Package org.apache.lucene.analysis.icu.segmentation
Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm.
-
Class Summary Class Description BreakIteratorWrapper Wraps RuleBasedBreakIterator, making object reuse convenient and emitting a rule status for emoji sequences.CharArrayIterator Wraps a char[] as CharacterIterator for processing with a BreakIteratorCompositeBreakIterator An internal BreakIterator for multilingual text, following recommendations from: UAX #29: Unicode Text Segmentation.DefaultICUTokenizerConfig DefaultICUTokenizerConfig
that is generally applicable to many languages.ICUTokenizer Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)ICUTokenizerConfig Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.ICUTokenizerFactory Factory forICUTokenizer
.ScriptIterator An iterator that locates ISO 15924 script boundaries in text.