com.ibm.icu.text

Class DictionaryBasedBreakIterator

Implemented Interfaces:
Cloneable

public class DictionaryBasedBreakIterator
extends RuleBasedBreakIterator_Old

A subclass of RuleBasedBreakIterator_Old that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

Nested Class Summary

protected class
DictionaryBasedBreakIterator.Builder
The Builder class for DictionaryBasedBreakIterator inherits almost all of its functionality from the Builder class for RuleBasedBreakIterator_Old, but extends it with extra logic to handle the DICTIONARY_VAR token

Nested classes/interfaces inherited from class com.ibm.icu.text.RuleBasedBreakIterator_Old

RuleBasedBreakIterator_Old.Builder

Field Summary

Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator_Old

IGNORE

Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator

WORD_IDEO, WORD_IDEO_LIMIT, WORD_KANA, WORD_KANA_LIMIT, WORD_LETTER, WORD_LETTER_LIMIT, WORD_NONE, WORD_NONE_LIMIT, WORD_NUMBER, WORD_NUMBER_LIMIT

Fields inherited from class com.ibm.icu.text.BreakIterator

DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD

Constructor Summary

DictionaryBasedBreakIterator(String description, InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator.

Method Summary

int
first()
Sets the current iteration position to the beginning of the text.
int
following(int offset)
Sets the current iteration position to the first boundary position after the specified position.
protected int
handleNext()
This is the implementation function for next().
int
last()
Sets the current iteration position to the end of the text.
protected int
lookupCategory(char c)
Looks up a character category for a character.
protected RuleBasedBreakIterator_Old.Builder
makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
int
preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.
int
previous()
Advances the iterator one step backwards.
void
setText(CharacterIterator newText)
void
writeTablesToFile(FileOutputStream file, boolean littleEndian)

Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator_Old

checkOffset, clone, current, debugDumpTables, debugPrintln, equals, first, following, getRuleStatus, getRuleStatusVec, getText, handleNext, handlePrevious, hashCode, isBoundary, last, lookupBackwardState, lookupCategory, lookupState, makeBuilder, next, next, preceding, previous, setText, toString, writeSwappedInt, writeSwappedShort, writeTablesToFile

Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator

clone, current, equals, first, following, getInstanceFromCompiledRules, getRuleStatus, getRuleStatusVec, getText, hashCode, isBoundary, last, next, next, preceding, previous, setText, toString

Methods inherited from class com.ibm.icu.text.BreakIterator

clone, current, first, following, getAvailableLocales, getAvailableULocales, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getText, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, isBoundary, last, next, next, preceding, previous, registerInstance, registerInstance, setText, setText, unregister

Constructor Details

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(String description,
                                    InputStream dictionaryStream)
            throws IOException
Constructs a DictionaryBasedBreakIterator.
Parameters:
description - Same as the description parameter on RuleBasedBreakIterator_Old, except for the special meaning of DICTIONARY_VAR. This parameter is just passed through to RuleBasedBreakIterator_Old's constructor.
dictionaryStream - the stream containing the dictionary data

Method Details

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
Overrides:
first in interface RuleBasedBreakIterator_Old
Returns:
The offset of the beginning of the text.

following

public int following(int offset)
Sets the current iteration position to the first boundary position after the specified position.
Overrides:
following in interface RuleBasedBreakIterator_Old
Parameters:
offset - The position to begin searching forward from
Returns:
The position of the first boundary after "offset"

handleNext

protected int handleNext()
This is the implementation function for next().
Overrides:
handleNext in interface RuleBasedBreakIterator_Old

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
Overrides:
last in interface RuleBasedBreakIterator_Old
Returns:
The text's past-the-end offset.

lookupCategory

protected int lookupCategory(char c)
Looks up a character category for a character.
Overrides:
lookupCategory in interface RuleBasedBreakIterator_Old

makeBuilder

protected RuleBasedBreakIterator_Old.Builder makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator. This is the same as RuleBasedBreakIterator_Old.Builder, except for the extra code to handle the DICTIONARY_VAR tag.
Overrides:
makeBuilder in interface RuleBasedBreakIterator_Old

preceding

public int preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.
Overrides:
preceding in interface RuleBasedBreakIterator_Old
Parameters:
offset - The position to begin searching from
Returns:
The position of the last boundary before "offset"

previous

public int previous()
Advances the iterator one step backwards.
Overrides:
previous in interface RuleBasedBreakIterator_Old
Returns:
The position of the last boundary position before the current iteration position

setText

public void setText(CharacterIterator newText)
Overrides:
setText in interface RuleBasedBreakIterator_Old

writeTablesToFile

public void writeTablesToFile(FileOutputStream file,
                              boolean littleEndian)
            throws IOException
Overrides:
writeTablesToFile in interface RuleBasedBreakIterator_Old

Copyright (c) 2006 IBM Corporation and others.