Class WordStorage


  • class WordStorage
    extends java.lang.Object
    A data structure for memory-efficient word storage and fast lookup/enumeration. Each dictionary entry is stored as:
    1. the last character
    2. pointer to a similar entry for the prefix (all characters except the last one)
    3. value data: a list of ints representing word flags and morphological data, and a pointer to hash collisions, if any
    There's only one entry for each prefix, so it's like a trie/FST, but a reversed one: each node points to a single previous node instead of several following ones. For example, "abc" and "abd" point to the same prefix entry "ab" which points to "a" which points to 0.

    The entries are stored in a contiguous byte array, identified by their offsets, using DataOutput.writeVInt(int) ()} VINT} format for compression.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static int COLLISION_MASK  
      private boolean hasCustomMorphData  
      private int[] hashTable
      A map from word's hash (modulo array's length) into an int containing: lower OFFSET_BITS: the offset in wordData of the last entry with this hash the remaining highest bits: COLLISION+SUGGESTIBLE+LENGTH info for that entry, i.e.
      private static int MAX_STORED_LENGTH  
      private int maxEntryLength  
      private static int OFFSET_BITS  
      private static int OFFSET_MASK  
      private static int SUGGESTIBLE_MASK  
      private byte[] wordData
      An array of word entries: VINT: the word's last character VINT: a delta pointer to the entry for the same word without the last character.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private WordStorage​(int maxEntryLength, boolean hasCustomMorphData, int[] hashTable, byte[] wordData)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private static boolean hasCollision​(int mask)  
      private boolean hasLength​(int mask, int length)  
      private static boolean hasLengthInRange​(int mask, int minLength, int maxLength)  
      private static boolean hasSuggestibleEntries​(int mask)  
      private boolean isSameString​(char[] word, int offset, int length, int dataPos, ByteArrayDataInput in)  
      (package private) IntsRef lookupWord​(char[] word, int offset, int length)  
      (package private) void processAllWords​(int minLength, int maxLength, boolean suggestibleOnly, java.util.function.BiConsumer<CharsRef,​java.util.function.Supplier<IntsRef>> processor)  
      (package private) void processSuggestibleWords​(int minLength, int maxLength, java.util.function.BiConsumer<CharsRef,​java.util.function.Supplier<IntsRef>> processor)
      Calls the processor for every dictionary entry with length between minLength and maxLength, both ends inclusive, and at least one suggestible alternative (without NOSUGGEST, FORBIDDENWORD or ONLYINCOMPOUND flags).
      private void readForms​(IntsRef forms, ByteArrayDataInput in, int length)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • maxEntryLength

        private final int maxEntryLength
      • hasCustomMorphData

        private final boolean hasCustomMorphData
      • hashTable

        private final int[] hashTable
        A map from word's hash (modulo array's length) into an int containing:
        • lower OFFSET_BITS: the offset in wordData of the last entry with this hash
        • the remaining highest bits: COLLISION+SUGGESTIBLE+LENGTH info for that entry, i.e. one bit indicating whether there are other entries with the same hash, one bit indicating whether this entry makes sense to be used in suggestions, and the length of the entry in chars, or MAX_STORED_LENGTH if the length exceeds that limit (next highest bits)
      • wordData

        private final byte[] wordData
        An array of word entries:
        • VINT: the word's last character
        • VINT: a delta pointer to the entry for the same word without the last character. Precisely, it's the difference of this entry's start and the prefix's entry start. 0 for single-character entries
        • (Optional, for hash-colliding entries only)
          • BYTE: COLLISION+SUGGESTIBLE+LENGTH info (see hashTable) for the previous entry with the same hash
          • VINT: (delta) pointer to the previous entry
        • (Optional, for non-leaf entries only) VINT+: word form data, returned from lookupWord(char[], int, int), preceded by its length
    • Constructor Detail

      • WordStorage

        private WordStorage​(int maxEntryLength,
                            boolean hasCustomMorphData,
                            int[] hashTable,
                            byte[] wordData)
    • Method Detail

      • lookupWord

        IntsRef lookupWord​(char[] word,
                           int offset,
                           int length)
      • hasCollision

        private static boolean hasCollision​(int mask)
      • hasSuggestibleEntries

        private static boolean hasSuggestibleEntries​(int mask)
      • processSuggestibleWords

        void processSuggestibleWords​(int minLength,
                                     int maxLength,
                                     java.util.function.BiConsumer<CharsRef,​java.util.function.Supplier<IntsRef>> processor)
        Calls the processor for every dictionary entry with length between minLength and maxLength, both ends inclusive, and at least one suggestible alternative (without NOSUGGEST, FORBIDDENWORD or ONLYINCOMPOUND flags). Note that the callback arguments (word and forms) are reused, so they can be modified in any way, but may not be saved for later by the processor
      • processAllWords

        void processAllWords​(int minLength,
                             int maxLength,
                             boolean suggestibleOnly,
                             java.util.function.BiConsumer<CharsRef,​java.util.function.Supplier<IntsRef>> processor)
      • hasLength

        private boolean hasLength​(int mask,
                                  int length)
      • hasLengthInRange

        private static boolean hasLengthInRange​(int mask,
                                                int minLength,
                                                int maxLength)
      • isSameString

        private boolean isSameString​(char[] word,
                                     int offset,
                                     int length,
                                     int dataPos,
                                     ByteArrayDataInput in)