Package org.languagetool.tokenizers.pt
Class PortugueseWordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.pt.PortugueseWordTokenizer
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class PortugueseWordTokenizer
extends org.languagetool.tokenizers.WordTokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token.
- Since:
- 3.6
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final Pattern
private static final String
private static final Pattern
private static final String
private static final Pattern
private static final String
private static final char
private static final Pattern
private static final Pattern
private static final String
private static final Pattern
private static final String
private static final char
private static final char
private static final char
private static final String
-
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.languagetool.tokenizers.WordTokenizer
getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
-
Field Details
-
SPLIT_CHARS
- See Also:
-
DECIMAL_COMMA_SUBST
private static final char DECIMAL_COMMA_SUBST- See Also:
-
NON_BREAKING_SPACE_SUBST
private static final char NON_BREAKING_SPACE_SUBST- See Also:
-
NON_BREAKING_DOT_SUBST
private static final char NON_BREAKING_DOT_SUBST- See Also:
-
NON_BREAKING_COLON_SUBST
private static final char NON_BREAKING_COLON_SUBST- See Also:
-
DECIMAL_COMMA_PATTERN
-
DECIMAL_COMMA_REPL
- See Also:
-
DECIMAL_SPACE_PATTERN
-
DOTTED_NUMBERS_PATTERN
-
DOTTED_NUMBERS_REPL
- See Also:
-
COLON_NUMBERS_PATTERN
-
COLON_NUMBERS_REPL
- See Also:
-
DATE_PATTERN
-
DATE_PATTERN_REPL
- See Also:
-
DOTTED_ORDINALS_PATTERN
-
DOTTED_ORDINALS_REPL
- See Also:
-
-
Constructor Details
-
PortugueseWordTokenizer
public PortugueseWordTokenizer()
-
-
Method Details
-
tokenize
- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
-