Package org.languagetool.tokenizers.br
Class BretonWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.br.BretonWordTokenizer
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class BretonWordTokenizer
extends org.languagetool.tokenizers.WordTokenizer
-
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
-
Constructor Details
-
BretonWordTokenizer
public BretonWordTokenizer()
-
-
Method Details
-
tokenize
Tokenizes just like WordTokenizer with the exception that "c’h" is not split. "C’h" is considered as a letter in breton (trigraph) and it occurs in many words. So tokenizer should not split it. Also split things like "n’eo" into 2 tokens only "n’" + "eo".- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
- Parameters:
text
- Text to tokenize- Returns:
- List of tokens. Note: a special string ##BR_APOS## is used to replace apostrophes during tokenizing.
-