Class RegexTokenizer

  • All Implemented Interfaces:
    Tokenizer<java.lang.CharSequence>

    final class RegexTokenizer
    extends java.lang.Object
    implements Tokenizer<java.lang.CharSequence>
    A simple word tokenizer that utilizes regex to find words. It applies a regex (\w)+ over the input text to extract words from a given character sequence.
    Since:
    1.0
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static java.util.regex.Pattern PATTERN
      The whitespace pattern.
    • Constructor Summary

      Constructors 
      Constructor Description
      RegexTokenizer()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.CharSequence[] tokenize​(java.lang.CharSequence text)
      Returns an array of tokens.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • PATTERN

        private static final java.util.regex.Pattern PATTERN
        The whitespace pattern.
    • Constructor Detail

      • RegexTokenizer

        RegexTokenizer()
    • Method Detail

      • tokenize

        public java.lang.CharSequence[] tokenize​(java.lang.CharSequence text)
        Returns an array of tokens.
        Specified by:
        tokenize in interface Tokenizer<java.lang.CharSequence>
        Parameters:
        text - input text
        Returns:
        array of tokens
        Throws:
        java.lang.IllegalArgumentException - if the input text is blank