Class PolishWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.pl.PolishWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class PolishWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
Since:
2.5
  • Field Details

    • plTokenizing

      private final String plTokenizing
    • tagger

      private org.languagetool.tagging.Tagger tagger
    • prefixes

      private static final Set<String> prefixes
  • Constructor Details

    • PolishWordTokenizer

      public PolishWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Tokenizes text. The Polish tokenizer differs from the standard one in the following respects:
      1. it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
      2. it includes n-dash and m-dash as tokenizing characters, as these are not included in the spelling dictionary;
      3. it splits two kinds of compound words containing a hyphen, such as dziecko-geniusz (two nouns), polsko-indonezyjski (an ad-adjectival adjective and adjective), polsko-francusko-niemiecki (two ad-adjectival adjectives and adjective), or osiemnaście-dwadzieścia (two numerals) but not words in which the hyphen occurs before a morphological ending (such as SMS-y).
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer
      Parameters:
      text - String of words to tokenize.
    • setTagger

      public void setTagger(org.languagetool.tagging.Tagger tagger)
      Set the tagger to use in tokenizing. This is called in the constructor of Polish class, but if the class is used separately, it has to be called after the constructor to use the hybrid hyphen-tokenizing.
      Parameters:
      tagger - The tagger to use (compatible only with the Polish BaseTagger that uses the delivered PoliMorfologik 2.1 or later).
      Since:
      2.5