Package org.languagetool.rules.en
Class GoogleStyleWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.rules.en.GoogleStyleWordTokenizer
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class GoogleStyleWordTokenizer
extends org.languagetool.tokenizers.WordTokenizer
Tokenize sentences to tokens like Google does for its ngram index. Note: there
doesn't seem to be official documentation about the way Google tokenizes there,
so this is just an approximation.
- Since:
- 3.2
-
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
-
Constructor Details
-
GoogleStyleWordTokenizer
public GoogleStyleWordTokenizer()
-
-
Method Details
-
getTokenizingCharacters
- Overrides:
getTokenizingCharacters
in classorg.languagetool.tokenizers.WordTokenizer
-
tokenize
- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
-