Class BretonWordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
org.languagetool.tokenizers.br.BretonWordTokenizer
All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer

public class BretonWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    Tokenizes just like WordTokenizer with the exception that "c’h" is not split.

    Methods inherited from class org.languagetool.tokenizers.WordTokenizer

    getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • BretonWordTokenizer

      public BretonWordTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String text)
      Tokenizes just like WordTokenizer with the exception that "c’h" is not split. "C’h" is considered as a letter in breton (trigraph) and it occurs in many words. So tokenizer should not split it. Also split things like "n’eo" into 2 tokens only "n’" + "eo".
      Specified by:
      tokenize in interface org.languagetool.tokenizers.Tokenizer
      Overrides:
      tokenize in class org.languagetool.tokenizers.WordTokenizer
      Parameters:
      text - Text to tokenize
      Returns:
      List of tokens. Note: a special string ##BR_APOS## is used to replace apostrophes during tokenizing.