org.apache.lucene.index.memory

Class PatternAnalyzer

public class PatternAnalyzer extends Analyzer

Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a java.io.Reader, that can flexibly separate text into terms via a regular expression Pattern (with behaviour identical to String#split(String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.

If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via String#split(String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Java Regular Expression Tutorial.

This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene TokenFilter chain. For example as in this stemming example:

 PatternAnalyzer pat = ...
 TokenStream tokenStream = new SnowballFilter(
     pat.tokenStream("content", "James is running round in the woods"), 
     "English"));
 

Author: whoschek.AT.lbl.DOT.gov

Field Summary
static PatternAnalyzerDEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
static PatternAnalyzerEXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader.
static PatternNON_WORD_PATTERN
"\\W+"; Divides text at non-letters (NOT Character.isLetter(c))
static PatternWHITESPACE_PATTERN
"\\s+"; Divides text at whitespaces (Character.isWhitespace(c))
Constructor Summary
PatternAnalyzer(Pattern pattern, boolean toLowerCase, Set stopWords)
Constructs a new instance with the given parameters.
Method Summary
booleanequals(Object other)
Indicates whether some other object is "equal to" this one.
inthashCode()
Returns a hash code value for the object.
TokenStreamtokenStream(String fieldName, String text)
Creates a token stream that tokenizes the given string into token terms (aka words).
TokenStreamtokenStream(String fieldName, Reader reader)
Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to tokenStream(String, String) and is less efficient than tokenStream(String, String).

Field Detail

DEFAULT_ANALYZER

public static final PatternAnalyzer DEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.

EXTENDED_ANALYZER

public static final PatternAnalyzer EXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html

NON_WORD_PATTERN

public static final Pattern NON_WORD_PATTERN
"\\W+"; Divides text at non-letters (NOT Character.isLetter(c))

WHITESPACE_PATTERN

public static final Pattern WHITESPACE_PATTERN
"\\s+"; Divides text at whitespaces (Character.isWhitespace(c))

Constructor Detail

PatternAnalyzer

public PatternAnalyzer(Pattern pattern, boolean toLowerCase, Set stopWords)
Constructs a new instance with the given parameters.

Parameters: pattern a regular expression delimiting tokens toLowerCase if true returns tokens after applying String.toLowerCase() stopWords if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via (String[])and/or WordlistLoaderas in WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt") or other stop words lists .

Method Detail

equals

public boolean equals(Object other)
Indicates whether some other object is "equal to" this one.

Parameters: other the reference object with which to compare.

Returns: true if equal, false otherwise

hashCode

public int hashCode()
Returns a hash code value for the object.

Returns: the hash code.

tokenStream

public TokenStream tokenStream(String fieldName, String text)
Creates a token stream that tokenizes the given string into token terms (aka words).

Parameters: fieldName the name of the field to tokenize (currently ignored). text the string to tokenize

Returns: a new token stream

tokenStream

public TokenStream tokenStream(String fieldName, Reader reader)
Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to tokenStream(String, String) and is less efficient than tokenStream(String, String).

Parameters: fieldName the name of the field to tokenize (currently ignored). reader the reader delivering the text

Returns: a new token stream

Copyright © 2000-2007 Apache Software Foundation. All Rights Reserved.