Class CatalanWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.ca.CatalanWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class CatalanWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets its own token. Special treatment for hyphens and apostrophes in Catalan.
  • Field Details

    • PF

      private static final String PF
      See Also:
    • maxPatterns

      private static final int maxPatterns
      See Also:
    • patterns

      private final Pattern[] patterns
    • DICT_FILENAME

      private static final String DICT_FILENAME
      See Also:
    • speller

      protected org.languagetool.rules.spelling.morfologik.MorfologikSpeller speller
    • ELA_GEMINADA

      private static final Pattern ELA_GEMINADA
    • ELA_GEMINADA_UPPERCASE

      private static final Pattern ELA_GEMINADA_UPPERCASE
    • APOSTROF_RECTE

      private static final Pattern APOSTROF_RECTE
    • APOSTROF_RODO

      private static final Pattern APOSTROF_RODO
    • APOSTROF_RECTE_1

      private static final Pattern APOSTROF_RECTE_1
    • APOSTROF_RODO_1

      private static final Pattern APOSTROF_RODO_1
    • NEARBY_HYPHENS

      private static final Pattern NEARBY_HYPHENS
    • HYPHENS

      private static final Pattern HYPHENS
    • DECIMAL_POINT

      private static final Pattern DECIMAL_POINT
    • DECIMAL_COMMA

      private static final Pattern DECIMAL_COMMA
    • SPACE_DIGITS0

      private static final Pattern SPACE_DIGITS0
    • SPACE_DIGITS

      private static final Pattern SPACE_DIGITS
    • SPACE_DIGITS2

      private static final Pattern SPACE_DIGITS2
  • Constructor Details

    • CatalanWordTokenizer

      public CatalanWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer
      Parameters:
      text - Text to tokenize
      Returns:
      List of tokens. Note: a special string CA_APOS is used to replace apostrophes, and CA_HYPHEN to replace hyphens.
    • wordsToAdd

      private List<String> wordsToAdd(String s)