Package org.languagetool.tokenizers.en
Class EnglishWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.en.EnglishWordTokenizer
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class EnglishWordTokenizer
extends org.languagetool.tokenizers.WordTokenizer
- Since:
- 2.5
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
-
Field Details
-
EXCEPTIONS
-
EXCEPTION_REPLACEMENT
-
-
Constructor Details
-
EnglishWordTokenizer
public EnglishWordTokenizer()
-
-
Method Details
-
getTokenizingCharacters
- Overrides:
getTokenizingCharacters
in classorg.languagetool.tokenizers.WordTokenizer
-
tokenize
Tokenizes text. The English tokenizer differs from the standard one in two respects:- it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
- it includes n-dash as a tokenizing character, as it is used without a whitespace in English.
- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
- Parameters:
text
- String of words to tokenize.
-