Class PortugueseWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.pt.PortugueseWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class PortugueseWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token.
Since:
3.6
  • Field Details

    • SPLIT_CHARS

      private static final String SPLIT_CHARS
      See Also:
    • DECIMAL_COMMA_SUBST

      private static final char DECIMAL_COMMA_SUBST
      See Also:
    • NON_BREAKING_SPACE_SUBST

      private static final char NON_BREAKING_SPACE_SUBST
      See Also:
    • NON_BREAKING_DOT_SUBST

      private static final char NON_BREAKING_DOT_SUBST
      See Also:
    • NON_BREAKING_COLON_SUBST

      private static final char NON_BREAKING_COLON_SUBST
      See Also:
    • DECIMAL_COMMA_PATTERN

      private static final Pattern DECIMAL_COMMA_PATTERN
    • DECIMAL_COMMA_REPL

      private static final String DECIMAL_COMMA_REPL
      See Also:
    • DECIMAL_SPACE_PATTERN

      private static final Pattern DECIMAL_SPACE_PATTERN
    • DOTTED_NUMBERS_PATTERN

      private static final Pattern DOTTED_NUMBERS_PATTERN
    • DOTTED_NUMBERS_REPL

      private static final String DOTTED_NUMBERS_REPL
      See Also:
    • COLON_NUMBERS_PATTERN

      private static final Pattern COLON_NUMBERS_PATTERN
    • COLON_NUMBERS_REPL

      private static final String COLON_NUMBERS_REPL
      See Also:
    • DATE_PATTERN

      private static final Pattern DATE_PATTERN
    • DATE_PATTERN_REPL

      private static final String DATE_PATTERN_REPL
      See Also:
    • DOTTED_ORDINALS_PATTERN

      private static final Pattern DOTTED_ORDINALS_PATTERN
    • DOTTED_ORDINALS_REPL

      private static final String DOTTED_ORDINALS_REPL
      See Also:
  • Constructor Details

    • PortugueseWordTokenizer

      public PortugueseWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer