Class AbstractDictionary

  • Direct Known Subclasses:
    BigramDictionary, WordDictionary

    abstract class AbstractDictionary
    extends java.lang.Object
    SmartChineseAnalyzer abstract dictionary implementation.

    Contains methods for dealing with GB2312 encoding.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int CHAR_NUM_IN_FILE
      Dictionary data contains 6768 Chinese characters with frequency statistics.
      static int GB2312_CHAR_NUM
      Last Chinese Character in GB2312 (87 * 94).
      static int GB2312_FIRST_CHAR
      First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String getCCByGB2312Id​(int ccid)
      Transcode from GB2312 ID to Unicode
      short getGB2312Id​(char ch)
      Transcode from Unicode to GB2312
      long hash1​(char c)
      32-bit FNV Hash Function
      long hash1​(char[] carray)
      32-bit FNV Hash Function
      int hash2​(char c)
      djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
      int hash2​(char[] carray)
      djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • GB2312_FIRST_CHAR

        public static final int GB2312_FIRST_CHAR
        First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.
        See Also:
        Constant Field Values
      • GB2312_CHAR_NUM

        public static final int GB2312_CHAR_NUM
        Last Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned.
        See Also:
        Constant Field Values
      • CHAR_NUM_IN_FILE

        public static final int CHAR_NUM_IN_FILE
        Dictionary data contains 6768 Chinese characters with frequency statistics.
        See Also:
        Constant Field Values
    • Constructor Detail

      • AbstractDictionary

        AbstractDictionary()
    • Method Detail

      • getCCByGB2312Id

        public java.lang.String getCCByGB2312Id​(int ccid)
        Transcode from GB2312 ID to Unicode

        GB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).

        Parameters:
        ccid - GB2312 id
        Returns:
        unicode String
      • getGB2312Id

        public short getGB2312Id​(char ch)
        Transcode from Unicode to GB2312
        Parameters:
        ch - input character in Unicode, or character in Basic Latin range.
        Returns:
        position in GB2312
      • hash1

        public long hash1​(char c)
        32-bit FNV Hash Function
        Parameters:
        c - input character
        Returns:
        hashcode
      • hash1

        public long hash1​(char[] carray)
        32-bit FNV Hash Function
        Parameters:
        carray - character array
        Returns:
        hashcode
      • hash2

        public int hash2​(char c)
        djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
        Parameters:
        c - character
        Returns:
        hashcode
      • hash2

        public int hash2​(char[] carray)
        djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
        Parameters:
        carray - character array
        Returns:
        hashcode