Class Lucene90BlockTreeTermsWriter
- java.lang.Object
-
- org.apache.lucene.codecs.FieldsConsumer
-
- org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public final class Lucene90BlockTreeTermsWriter extends FieldsConsumer
Block-based terms index and dictionary writer.Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.
Files:
.tim
: Term Dictionary.tmd
: Term Metadata.tip
: Term Index
Term Dictionary
The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).
The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.
NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.
- TermsDict (.tim) --> Header, FieldDictNumFields, Footer
- FieldDict --> PostingsHeader, NodeBlockNumBlocks
- NodeBlock --> (OuterNode | InnerNode)
- OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
- InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
- TermStats --> DocFreq, TotalTermFreq
- Header -->
CodecHeader
- EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength -->
VInt
- TotalTermFreq -->
VLong
- Footer -->
CodecFooter
Notes:
- Header is a
CodecHeader
storing the version information for the BlockTree implementation. - DocFreq is the count of documents which contain the term.
- TotalTermFreq is the total number of occurrences of the term. This is encoded as the difference between the total number of occurrences and the DocFreq.
- PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
- For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted.
Term Metadata
The .tmd file contains the list of term metadata (such as FST index metadata) and field level statistics (such as sum of total term freq).
- TermsMeta (.tmd) --> Header, NumFields, <FieldStats>NumFields, TermIndexLength, TermDictLength, Footer
- FieldStats --> FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, SumTotalTermFreq?, SumDocFreq, DocCount, MinTerm, MaxTerm, IndexStartFP, FSTHeader, FSTMetadata
- Header,FSTHeader -->
CodecHeader
- TermIndexLength, TermDictLength -->
Uint64
- MinTerm,MaxTerm -->
VInt
length followed by the byte[] - NumFields,FieldNumber,RootCodeLength,DocCount -->
VInt
- NumTerms,SumTotalTermFreq,SumDocFreq,IndexStartFP -->
VLong
- Footer -->
CodecFooter
Notes:
- FieldNumber is the fields number from
FieldInfos
. (.fnm) - NumTerms is the number of unique terms for the field.
- RootCode points to the root block for the field.
- SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
- DocCount is the number of documents that have at least one posting for this field.
- MinTerm, MaxTerm are the lowest and highest term in this field.
Term Index
The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.
- TermsIndex (.tip) --> Header, FSTIndexNumFieldsFooter
- Header -->
CodecHeader
- FSTIndex -->
FST<byte[]>
- Footer -->
CodecFooter
Notes:
- The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
- It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.
- See Also:
Lucene90BlockTreeTermsReader
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private class
Lucene90BlockTreeTermsWriter.PendingBlock
private static class
Lucene90BlockTreeTermsWriter.PendingEntry
private static class
Lucene90BlockTreeTermsWriter.PendingTerm
private static class
Lucene90BlockTreeTermsWriter.StatsWriter
(package private) class
Lucene90BlockTreeTermsWriter.TermsWriter
-
Field Summary
Fields Modifier and Type Field Description private boolean
closed
static int
DEFAULT_MAX_BLOCK_SIZE
Suggested default value for themaxItemsInBlock
parameter toLucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.static int
DEFAULT_MIN_BLOCK_SIZE
Suggested default value for theminItemsInBlock
parameter toLucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.(package private) static BytesRef
EMPTY_BYTES_REF
(package private) FieldInfos
fieldInfos
private java.util.List<ByteBuffersDataOutput>
fields
private IndexOutput
indexOut
(package private) int
maxDoc
(package private) int
maxItemsInBlock
private IndexOutput
metaOut
(package private) int
minItemsInBlock
(package private) PostingsWriterBase
postingsWriter
private ByteBuffersDataOutput
scratchBytes
private IntsRefBuilder
scratchIntsRef
private IndexOutput
termsOut
(package private) int
version
-
Constructor Summary
Constructors Constructor Description Lucene90BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock)
Create a new writer.Lucene90BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock, int version)
Expert constructor that allows configuring the version, used for bw tests.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description (package private) static java.lang.String
brToString(byte[] b)
(package private) static java.lang.String
brToString(BytesRef b)
void
close()
(package private) static long
encodeOutput(long fp, boolean hasTerms, boolean isFloor)
static void
validateSettings(int minItemsInBlock, int maxItemsInBlock)
ThrowsIllegalArgumentException
if any of these settings is invalid.void
write(Fields fields, NormsProducer norms)
Write all fields, terms and postings.private static void
writeBytesRef(DataOutput out, BytesRef bytes)
(package private) static void
writeMSBVLong(long l, DataOutput scratchBytes)
Encodes long value to variable length byte[], in MSB order.-
Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge
-
-
-
-
Field Detail
-
DEFAULT_MIN_BLOCK_SIZE
public static final int DEFAULT_MIN_BLOCK_SIZE
Suggested default value for theminItemsInBlock
parameter toLucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.- See Also:
- Constant Field Values
-
DEFAULT_MAX_BLOCK_SIZE
public static final int DEFAULT_MAX_BLOCK_SIZE
Suggested default value for themaxItemsInBlock
parameter toLucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
.- See Also:
- Constant Field Values
-
metaOut
private final IndexOutput metaOut
-
termsOut
private final IndexOutput termsOut
-
indexOut
private final IndexOutput indexOut
-
maxDoc
final int maxDoc
-
minItemsInBlock
final int minItemsInBlock
-
maxItemsInBlock
final int maxItemsInBlock
-
version
final int version
-
postingsWriter
final PostingsWriterBase postingsWriter
-
fieldInfos
final FieldInfos fieldInfos
-
fields
private final java.util.List<ByteBuffersDataOutput> fields
-
scratchBytes
private final ByteBuffersDataOutput scratchBytes
-
scratchIntsRef
private final IntsRefBuilder scratchIntsRef
-
EMPTY_BYTES_REF
static final BytesRef EMPTY_BYTES_REF
-
closed
private boolean closed
-
-
Constructor Detail
-
Lucene90BlockTreeTermsWriter
public Lucene90BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock) throws java.io.IOException
Create a new writer. The number of items (terms or sub-blocks) per block will aim to be between minItemsPerBlock and maxItemsPerBlock, though in some cases the blocks may be smaller than the min.- Throws:
java.io.IOException
-
Lucene90BlockTreeTermsWriter
public Lucene90BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock, int version) throws java.io.IOException
Expert constructor that allows configuring the version, used for bw tests.- Throws:
java.io.IOException
-
-
Method Detail
-
validateSettings
public static void validateSettings(int minItemsInBlock, int maxItemsInBlock)
ThrowsIllegalArgumentException
if any of these settings is invalid.
-
write
public void write(Fields fields, NormsProducer norms) throws java.io.IOException
Description copied from class:FieldsConsumer
Write all fields, terms and postings. This the "pull" API, allowing you to iterate more than once over the postings, somewhat analogous to using a DOM API to traverse an XML tree.Notes:
- You must compute index statistics, including each Term's docFreq and totalTermFreq, as well as the summary sumTotalTermFreq, sumTotalDocFreq and docCount.
- You must skip terms that have no docs and fields that have no terms, even though the provided Fields API will expose them; this typically requires lazily writing the field or term until you've actually seen the first term or document.
- The provided Fields instance is limited: you cannot call any methods that return statistics/counts; you cannot pass a non-null live docs when pulling docs/positions enums.
- Specified by:
write
in classFieldsConsumer
- Throws:
java.io.IOException
-
encodeOutput
static long encodeOutput(long fp, boolean hasTerms, boolean isFloor)
-
brToString
static java.lang.String brToString(BytesRef b)
-
brToString
static java.lang.String brToString(byte[] b)
-
writeMSBVLong
static void writeMSBVLong(long l, DataOutput scratchBytes) throws java.io.IOException
Encodes long value to variable length byte[], in MSB order. UseFieldReader.readMSBVLong(org.apache.lucene.store.DataInput)
to decode.Package private for testing
- Throws:
java.io.IOException
-
close
public void close() throws java.io.IOException
- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
- Specified by:
close
in classFieldsConsumer
- Throws:
java.io.IOException
-
writeBytesRef
private static void writeBytesRef(DataOutput out, BytesRef bytes) throws java.io.IOException
- Throws:
java.io.IOException
-
-