Class HMMChineseTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.util.SegmentingTokenizerBase
-
- org.apache.lucene.analysis.cn.smart.HMMChineseTokenizer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public class HMMChineseTokenizer extends SegmentingTokenizerBase
Tokenizer for Chinese or mixed Chinese-English text.The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description private OffsetAttribute
offsetAtt
private static java.text.BreakIterator
sentenceProto
used for breaking the text into sentencesprivate CharTermAttribute
termAtt
private java.util.Iterator<SegToken>
tokens
private TypeAttribute
typeAtt
private WordSegmenter
wordSegmenter
-
Fields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offset
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description HMMChineseTokenizer()
Creates a new HMMChineseTokenizerHMMChineseTokenizer(AttributeFactory factory)
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
incrementWord()
Returns true if another word is availablevoid
reset()
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.protected void
setNextSentence(int sentenceStart, int sentenceEnd)
Provides the next input sentence for analysis-
Methods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEnd
-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
sentenceProto
private static final java.text.BreakIterator sentenceProto
used for breaking the text into sentences
-
termAtt
private final CharTermAttribute termAtt
-
offsetAtt
private final OffsetAttribute offsetAtt
-
typeAtt
private final TypeAttribute typeAtt
-
wordSegmenter
private final WordSegmenter wordSegmenter
-
tokens
private java.util.Iterator<SegToken> tokens
-
-
Constructor Detail
-
HMMChineseTokenizer
public HMMChineseTokenizer()
Creates a new HMMChineseTokenizer
-
HMMChineseTokenizer
public HMMChineseTokenizer(AttributeFactory factory)
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
-
Method Detail
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd)
Description copied from class:SegmentingTokenizerBase
Provides the next input sentence for analysis- Specified by:
setNextSentence
in classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()
Description copied from class:SegmentingTokenizerBase
Returns true if another word is available- Specified by:
incrementWord
in classSegmentingTokenizerBase
-
reset
public void reset() throws java.io.IOException
Description copied from class:TokenStream
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset()
, otherwise some internal state will not be correctly reset (e.g.,Tokenizer
will throwIllegalStateException
on further usage).- Overrides:
reset
in classSegmentingTokenizerBase
- Throws:
java.io.IOException
-
-