Class FSTCompiler<T>


  • public class FSTCompiler<T>
    extends java.lang.Object
    Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type T is the output type. See the subclasses of Outputs.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    • Constructor Detail

      • FSTCompiler

        private FSTCompiler​(FST.INPUT_TYPE inputType,
                            double suffixRAMLimitMB,
                            Outputs<T> outputs,
                            boolean allowFixedLengthArcs,
                            int bytesPageBits,
                            float directAddressingMaxOversizingFactor,
                            int version)
    • Method Detail

      • getDirectAddressingMaxOversizingFactor

        public float getDirectAddressingMaxOversizingFactor()
      • getNodeCount

        public long getNodeCount()
      • getArcCount

        public long getArcCount()
      • getMappedStateCount

        public long getMappedStateCount()
      • writeLabel

        private void writeLabel​(DataOutput out,
                                int v)
                         throws java.io.IOException
        Throws:
        java.io.IOException
      • shouldExpandNodeWithFixedLengthArcs

        private boolean shouldExpandNodeWithFixedLengthArcs​(FSTCompiler.UnCompiledNode<T> node)
        Returns whether the given node should be expanded with fixed length arcs. Nodes will be expanded depending on their depth (distance from the root node) and their number of arcs.

        Nodes with fixed length arcs use more space, because they encode all arcs with a fixed number of bytes, but they allow either binary search or direct addressing on the arcs (instead of linear scan) on lookup by arc label.

      • shouldExpandNodeWithDirectAddressing

        private boolean shouldExpandNodeWithDirectAddressing​(FSTCompiler.UnCompiledNode<T> nodeIn,
                                                             int numBytesPerArc,
                                                             int maxBytesPerArcWithoutLabel,
                                                             int labelRange)
        Returns whether the given node should be expanded with direct addressing instead of binary search.

        Prefer direct addressing for performance if it does not oversize binary search byte size too much, so that the arcs can be directly addressed by label.

        See Also:
        getDirectAddressingMaxOversizingFactor()
      • writeNodeForBinarySearch

        private void writeNodeForBinarySearch​(FSTCompiler.UnCompiledNode<T> nodeIn,
                                              long startAddress,
                                              int maxBytesPerArc)
      • writeNodeForDirectAddressingOrContinuous

        private void writeNodeForDirectAddressingOrContinuous​(FSTCompiler.UnCompiledNode<T> nodeIn,
                                                              long startAddress,
                                                              int maxBytesPerArcWithoutLabel,
                                                              int labelRange,
                                                              boolean continuous)
      • freezeTail

        private void freezeTail​(int prefixLenPlus1)
                         throws java.io.IOException
        Throws:
        java.io.IOException
      • add

        public void add​(IntsRef input,
                        T output)
                 throws java.io.IOException
        Add the next input/output pair. The provided input must be sorted after the previous one according to IntsRef.compareTo(org.apache.lucene.util.IntsRef). It's also OK to add the same input twice in a row with different outputs, as long as Outputs implements the Outputs.merge(T, T) method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (eg ByteSequenceOutputs or IntSequenceOutputs) then you cannot reuse across calls.
        Throws:
        java.io.IOException
      • setEmptyOutput

        void setEmptyOutput​(T v)
      • finish

        void finish​(long newStartNode)
      • validOutput

        private boolean validOutput​(T output)
      • compile

        public FST<T> compile()
                       throws java.io.IOException
        Returns final FST. NOTE: this will return null if nothing is accepted by the FST.
        Throws:
        java.io.IOException
      • fstRamBytesUsed

        public long fstRamBytesUsed()
      • fstSizeInBytes

        public long fstSizeInBytes()