Class WordDictionary
- java.lang.Object
-
- org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
-
- org.apache.lucene.analysis.cn.smart.hhmm.WordDictionary
-
class WordDictionary extends AbstractDictionary
SmartChineseAnalyzer Word Dictionary
-
-
Field Summary
Fields Modifier and Type Field Description private char[]
charIndexTable
static int
PRIME_INDEX_LENGTH
Large prime number for hash functionprivate static WordDictionary
singleInstance
private short[]
wordIndexTable
wordIndexTable guarantees to hash all Chinese characters in Unicode into PRIME_INDEX_LENGTH array.private char[][][]
wordItem_charArrayTable
To avoid taking too much space, the data structure needed to store the lexicon requires two multidimensional arrays to store word and frequency.private int[][]
wordItem_frequencyTable
-
Fields inherited from class org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
CHAR_NUM_IN_FILE, GB2312_CHAR_NUM, GB2312_FIRST_CHAR
-
-
Constructor Summary
Constructors Modifier Constructor Description private
WordDictionary()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
expandDelimiterData()
The original lexicon puts all information with punctuation into a chart (from 1 to 3755).private int
findInTable(short knownHashIndex, char[] charArray)
Look up the text string corresponding with the word char array, and return the position of the word list.private short
getAvaliableTableIndex(char c)
int
getFrequency(char[] charArray)
Get the frequency of a word from the dictionarystatic WordDictionary
getInstance()
Get the singleton dictionary instance.int
getPrefixMatch(char[] charArray)
Find the first word in the dictionary that starts with the supplied prefixint
getPrefixMatch(char[] charArray, int knownStart)
Find the nth word in the dictionary that starts with the supplied prefixprivate short
getWordItemTableIndex(char c)
boolean
isEqual(char[] charArray, int itemIndex)
Return true if the dictionary entry at itemIndex for table charArray[0] is charArrayvoid
load()
Load coredict.mem internally from the jar file.void
load(java.lang.String dctFileRoot)
Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dctprivate boolean
loadFromObj(java.nio.file.Path serialObj)
private void
loadFromObjectInputStream(java.io.InputStream serialObjectInputStream)
private int
loadMainDataFromFile(java.lang.String dctFilePath)
Load the datafile into this WordDictionaryprivate void
mergeSameWords()
private void
saveToObj(java.nio.file.Path serialObj)
private boolean
setTableIndex(char c, int j)
private void
sortEachItems()
-
Methods inherited from class org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
getCCByGB2312Id, getGB2312Id, hash1, hash1, hash2, hash2
-
-
-
-
Field Detail
-
singleInstance
private static WordDictionary singleInstance
-
PRIME_INDEX_LENGTH
public static final int PRIME_INDEX_LENGTH
Large prime number for hash function- See Also:
- Constant Field Values
-
wordIndexTable
private short[] wordIndexTable
wordIndexTable guarantees to hash all Chinese characters in Unicode into PRIME_INDEX_LENGTH array. There will be conflict, but in reality this program only handles the 6768 characters found in GB2312 plus some ASCII characters. Therefore in order to guarantee better precision, it is necessary to retain the original symbol in the charIndexTable.
-
charIndexTable
private char[] charIndexTable
-
wordItem_charArrayTable
private char[][][] wordItem_charArrayTable
To avoid taking too much space, the data structure needed to store the lexicon requires two multidimensional arrays to store word and frequency. Each word is placed in a char[]. Each char represents a Chinese char or other symbol. Each frequency is put into an int. These two arrays correspond to each other one-to-one. Therefore, one can use wordItem_charArrayTable[i][j] to look up word from lexicon, and wordItem_frequencyTable[i][j] to look up the corresponding frequency.
-
wordItem_frequencyTable
private int[][] wordItem_frequencyTable
-
-
Method Detail
-
getInstance
public static WordDictionary getInstance()
Get the singleton dictionary instance.- Returns:
- singleton
-
load
public void load(java.lang.String dctFileRoot)
Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dct- Parameters:
dctFileRoot
- path to dictionary directory
-
load
public void load() throws java.io.IOException, java.lang.ClassNotFoundException
Load coredict.mem internally from the jar file.- Throws:
java.io.IOException
- If there is a low-level I/O error.java.lang.ClassNotFoundException
-
loadFromObj
private boolean loadFromObj(java.nio.file.Path serialObj)
-
loadFromObjectInputStream
private void loadFromObjectInputStream(java.io.InputStream serialObjectInputStream) throws java.io.IOException, java.lang.ClassNotFoundException
- Throws:
java.io.IOException
java.lang.ClassNotFoundException
-
saveToObj
private void saveToObj(java.nio.file.Path serialObj)
-
loadMainDataFromFile
private int loadMainDataFromFile(java.lang.String dctFilePath) throws java.io.IOException
Load the datafile into this WordDictionary- Parameters:
dctFilePath
- path to word dictionary (coredict.dct)- Returns:
- number of words read
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
expandDelimiterData
private void expandDelimiterData()
The original lexicon puts all information with punctuation into a chart (from 1 to 3755). Here it then gets expanded, separately being placed into the chart that has the corresponding symbol.
-
mergeSameWords
private void mergeSameWords()
-
sortEachItems
private void sortEachItems()
-
setTableIndex
private boolean setTableIndex(char c, int j)
-
getAvaliableTableIndex
private short getAvaliableTableIndex(char c)
-
getWordItemTableIndex
private short getWordItemTableIndex(char c)
-
findInTable
private int findInTable(short knownHashIndex, char[] charArray)
Look up the text string corresponding with the word char array, and return the position of the word list.- Parameters:
knownHashIndex
- already figure out position of the first word symbol charArray[0] in hash table. If not calculated yet, can be replaced with function int findInTable(char[] charArray).charArray
- look up the char array corresponding with the word.- Returns:
- word location in word array. If not found, then return -1.
-
getPrefixMatch
public int getPrefixMatch(char[] charArray)
Find the first word in the dictionary that starts with the supplied prefix- Parameters:
charArray
- input prefix- Returns:
- index of word, or -1 if not found
- See Also:
getPrefixMatch(char[], int)
-
getPrefixMatch
public int getPrefixMatch(char[] charArray, int knownStart)
Find the nth word in the dictionary that starts with the supplied prefix- Parameters:
charArray
- input prefixknownStart
- relative position in the dictionary to start- Returns:
- index of word, or -1 if not found
- See Also:
getPrefixMatch(char[])
-
getFrequency
public int getFrequency(char[] charArray)
Get the frequency of a word from the dictionary- Parameters:
charArray
- input word- Returns:
- word frequency, or zero if the word is not found
-
isEqual
public boolean isEqual(char[] charArray, int itemIndex)
Return true if the dictionary entry at itemIndex for table charArray[0] is charArray- Parameters:
charArray
- input worditemIndex
- item index for table charArray[0]- Returns:
- true if the entry exists
-
-