Class EnglishWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.en.EnglishWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class EnglishWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
Since:
2.5
  • Field Details

    • EXCEPTIONS

      private static final String[] EXCEPTIONS
    • EXCEPTION_REPLACEMENT

      private static final String[] EXCEPTION_REPLACEMENT
  • Constructor Details

    • EnglishWordTokenizer

      public EnglishWordTokenizer()
  • Method Details

    • getTokenizingCharacters

      public String getTokenizingCharacters()
      Overrides:
      getTokenizingCharacters in class org.languagetool.tokenizers.WordTokenizer
    • tokenize

      public List<String> tokenize(String text)
      Tokenizes text. The English tokenizer differs from the standard one in two respects:
      1. it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
      2. it includes n-dash as a tokenizing character, as it is used without a whitespace in English.
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer
      Parameters:
      text - String of words to tokenize.