Class GoogleStyleWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.rules.en.GoogleStyleWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class GoogleStyleWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
Tokenize sentences to tokens like Google does for its ngram index. Note: there doesn't seem to be official documentation about the way Google tokenizes there, so this is just an approximation.
Since:
3.2
  • Constructor Details

    • GoogleStyleWordTokenizer

      public GoogleStyleWordTokenizer()
  • Method Details

    • getTokenizingCharacters

      public String getTokenizingCharacters()
      Overrides:
      getTokenizingCharacters in class org.languagetool.tokenizers.WordTokenizer
    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer