org.apache.lucene.analysis

Class LetterTokenizer

Known Direct Subclasses:
LowerCaseTokenizer

public class LetterTokenizer
extends CharTokenizer

A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

input

Constructor Summary

LetterTokenizer(Reader in)
Construct a new LetterTokenizer.

Method Summary

protected boolean
isTokenChar(char c)
Collects only characters which satisfy Character.isLetter(char).

Methods inherited from class org.apache.lucene.analysis.CharTokenizer

isTokenChar, next, normalize

Methods inherited from class org.apache.lucene.analysis.Tokenizer

close

Methods inherited from class org.apache.lucene.analysis.TokenStream

close, next

Constructor Details

LetterTokenizer

public LetterTokenizer(Reader in)
Construct a new LetterTokenizer.

Method Details

isTokenChar

protected boolean isTokenChar(char c)
Collects only characters which satisfy Character.isLetter(char).
Overrides:
isTokenChar in interface CharTokenizer

Copyright © 2000-2007 Apache Software Foundation. All Rights Reserved.