Class CommonGramsFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.commongrams.CommonGramsFilter
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Unwrappable<TokenStream>
public final class CommonGramsFilter extends TokenFilter
Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use ofPositionIncrementAttribute.setPositionIncrement(int)
. Bigrams have a type ofGRAM_TYPE
Example:- input:"the quick brown fox"
- output:|"the","the-quick"|"brown"|"fox"|
- "the-quick" has a position increment of 0 so it is in the same position as "the" "the-quick" has a term.type() of "gram"
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description private java.lang.StringBuilder
buffer
private CharArraySet
commonWords
static java.lang.String
GRAM_TYPE
private int
lastStartOffset
private boolean
lastWasCommon
private OffsetAttribute
offsetAttribute
private PositionIncrementAttribute
posIncAttribute
private PositionLengthAttribute
posLenAttribute
private AttributeSource.State
savedState
private static char
SEPARATOR
private CharTermAttribute
termAttribute
private TypeAttribute
typeAttribute
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description CommonGramsFilter(TokenStream input, CharArraySet commonWords)
Construct a token stream filtering the given input using a Set of common words to create bigrams.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
gramToken()
Constructs a compound token.boolean
incrementToken()
Inserts bigrams for common words into a token stream.private boolean
isCommon()
Determines if the current token is a common termvoid
reset()
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.private void
saveTermBuffer()
Saves this information to form the left part of a gram-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
GRAM_TYPE
public static final java.lang.String GRAM_TYPE
- See Also:
- Constant Field Values
-
SEPARATOR
private static final char SEPARATOR
- See Also:
- Constant Field Values
-
commonWords
private final CharArraySet commonWords
-
buffer
private final java.lang.StringBuilder buffer
-
termAttribute
private final CharTermAttribute termAttribute
-
offsetAttribute
private final OffsetAttribute offsetAttribute
-
typeAttribute
private final TypeAttribute typeAttribute
-
posIncAttribute
private final PositionIncrementAttribute posIncAttribute
-
posLenAttribute
private final PositionLengthAttribute posLenAttribute
-
lastStartOffset
private int lastStartOffset
-
lastWasCommon
private boolean lastWasCommon
-
savedState
private AttributeSource.State savedState
-
-
Constructor Detail
-
CommonGramsFilter
public CommonGramsFilter(TokenStream input, CharArraySet commonWords)
Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words .- Parameters:
input
- TokenStream input in filter chaincommonWords
- The set of common words.
-
-
Method Detail
-
incrementToken
public boolean incrementToken() throws java.io.IOException
Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram"TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be changed to work with this.
TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups.
- Specified by:
incrementToken
in classTokenStream
- Returns:
- false for end of stream; true otherwise
- Throws:
java.io.IOException
-
reset
public void reset() throws java.io.IOException
Description copied from class:TokenFilter
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset()
, otherwise some internal state will not be correctly reset (e.g.,Tokenizer
will throwIllegalStateException
on further usage).NOTE: The default implementation chains the call to the input TokenStream, so be sure to call
super.reset()
when overriding this method.- Overrides:
reset
in classTokenFilter
- Throws:
java.io.IOException
-
isCommon
private boolean isCommon()
Determines if the current token is a common term- Returns:
true
if the current token is a common term,false
otherwise
-
saveTermBuffer
private void saveTermBuffer()
Saves this information to form the left part of a gram
-
gramToken
private void gramToken()
Constructs a compound token.
-
-