Package org.languagetool.tokenizers.pl
Class PolishWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.pl.PolishWordTokenizer
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class PolishWordTokenizer
extends org.languagetool.tokenizers.WordTokenizer
- Since:
- 2.5
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
-
Field Details
-
plTokenizing
-
tagger
private org.languagetool.tagging.Tagger tagger -
prefixes
-
-
Constructor Details
-
PolishWordTokenizer
public PolishWordTokenizer()
-
-
Method Details
-
tokenize
Tokenizes text. The Polish tokenizer differs from the standard one in the following respects:- it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
- it includes n-dash and m-dash as tokenizing characters, as these are not included in the spelling dictionary;
- it splits two kinds of compound words containing a hyphen, such as dziecko-geniusz (two nouns), polsko-indonezyjski (an ad-adjectival adjective and adjective), polsko-francusko-niemiecki (two ad-adjectival adjectives and adjective), or osiemnaście-dwadzieścia (two numerals) but not words in which the hyphen occurs before a morphological ending (such as SMS-y).
- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
- Parameters:
text
- String of words to tokenize.
-
setTagger
public void setTagger(org.languagetool.tagging.Tagger tagger) Set the tagger to use in tokenizing. This is called in the constructor of Polish class, but if the class is used separately, it has to be called after the constructor to use the hybrid hyphen-tokenizing.- Parameters:
tagger
- The tagger to use (compatible only with the PolishBaseTagger
that uses the delivered PoliMorfologik 2.1 or later).- Since:
- 2.5
-