- java.lang.Object
-
- org.apache.lucene.index.memory.MemoryIndex
-
public class MemoryIndex extends java.lang.Object
High-performance single-document main memory Apache Lucene fulltext search index.Overview
This class is a replacement/substitute for RAM-resident
Directory
implementations. It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Rather than targeting fulltext search of infrequent queries over huge persistent data archives (historic search), this class targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). For example as infloat score = search(String text, Query query)
Each instance can hold at most one Lucene "document", with a document containing zero or more "fields", each field having a name and a fulltext value. The fulltext value is tokenized (split and transformed) into zero or more index terms (aka words) on
addField()
, according to the policy implemented by an Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop words), reduce the terms to their natural linguistic root form such as "fishing" being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. For details, see Lucene Analyzer Intro.Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.
For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.
Example Usage
Analyzer analyzer = new SimpleAnalyzer(version); MemoryIndex index = new MemoryIndex(); index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer); index.addField("author", "Tales of James", analyzer); QueryParser parser = new QueryParser(version, "content", analyzer); float score = index.search(parser.parse("+author:james +salmon~ +fish* manual~")); if (score > 0.0f) { System.out.println("it's a match"); } else { System.out.println("no match found"); } System.out.println("indexData=" + index.toString());
Example XQuery Usage
(: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :) declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :) for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book
Thread safety guarantees
MemoryIndex is not normally thread-safe for adds or queries. However, queries are thread-safe after
freeze()
has been called.Performance Notes
Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework.
This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAM-resident directory.
Note that other
Directory
implementations have particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case.Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.
If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
MemoryIndex.BinaryDocValuesProducer
private static class
MemoryIndex.BytesRefHashDocValuesProducer
private class
MemoryIndex.Info
Index data structure for a field; contains the tokenized term texts and their positions.private static class
MemoryIndex.MemoryDocValuesIterator
private class
MemoryIndex.MemoryIndexReader
Search support for Lucene framework integration; implements all methods required by the Lucene IndexReader contracts.private static class
MemoryIndex.NumericDocValuesProducer
private static class
MemoryIndex.SliceByteStartArray
(package private) static class
MemoryIndex.SlicedIntBlockPool
-
Field Summary
Fields Modifier and Type Field Description private ByteBlockPool
byteBlockPool
private Counter
bytesUsed
private static boolean
DEBUG
private FieldType
defaultFieldType
private java.util.SortedMap<java.lang.String,MemoryIndex.Info>
fields
info for each field: Map<String fieldName, Info field>private boolean
frozen
private Similarity
normSimilarity
private BytesRefArray
payloadsBytesRefs
private MemoryIndex.SlicedIntBlockPool.SliceWriter
postingsWriter
private MemoryIndex.SlicedIntBlockPool
slicedIntBlockPool
private boolean
storeOffsets
private boolean
storePayloads
-
Constructor Summary
Constructors Constructor Description MemoryIndex()
Constructs an empty instance that will not store offsets or payloads.MemoryIndex(boolean storeOffsets)
Constructs an empty instance that can optionally store the start and end character offset of each token term in the text.MemoryIndex(boolean storeOffsets, boolean storePayloads)
Constructs an empty instance with the option of storing offsets and payloads.MemoryIndex(boolean storeOffsets, boolean storePayloads, long maxReusedBytes)
Expert: This constructor accepts an upper limit for the number of bytes that should be reused if this instance isreset()
.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addField(java.lang.String fieldName, java.lang.String text, Analyzer analyzer)
Convenience method; Tokenizes the given field text and adds the resulting terms to the index; Equivalent to adding an indexed non-keyword LuceneField
that is tokenized, not stored, termVectorStored with positions (or termVectorStored with positions and offsets),void
addField(java.lang.String fieldName, TokenStream stream)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, LuceneField
.void
addField(java.lang.String fieldName, TokenStream stream, int positionIncrementGap)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, LuceneField
.void
addField(java.lang.String fieldName, TokenStream tokenStream, int positionIncrementGap, int offsetGap)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, LuceneField
.void
addField(IndexableField field, Analyzer analyzer)
Adds a luceneIndexableField
to the MemoryIndex using the provided analyzer.private FieldInfo
createFieldInfo(java.lang.String fieldName, int ord, IndexableFieldType fieldType)
IndexSearcher
createSearcher()
Creates and returns a searcher that can be used to execute arbitrary Lucene queries and to collect the resulting query results as hits.void
freeze()
Prepares the MemoryIndex for querying in a non-lazy way.static MemoryIndex
fromDocument(java.lang.Iterable<? extends IndexableField> document, Analyzer analyzer)
Builds a MemoryIndex from a luceneDocument
using an analyzerstatic MemoryIndex
fromDocument(java.lang.Iterable<? extends IndexableField> document, Analyzer analyzer, boolean storeOffsets, boolean storePayloads)
Builds a MemoryIndex from a luceneDocument
using an analyzerstatic MemoryIndex
fromDocument(java.lang.Iterable<? extends IndexableField> document, Analyzer analyzer, boolean storeOffsets, boolean storePayloads, long maxReusedBytes)
Builds a MemoryIndex from a luceneDocument
using an analyzerprivate MemoryIndex.Info
getInfo(java.lang.String fieldName, IndexableFieldType fieldType)
<T> TokenStream
keywordTokenStream(java.util.Collection<T> keywords)
Convenience method; Creates and returns a token stream that generates a token for each keyword in the given collection, "as is", without any transforming text analysis.private static NumericDocValues
numericDocValues(long value)
private static SortedNumericDocValues
numericDocValues(long[] values, int count)
void
reset()
Resets theMemoryIndex
to its initial state and recycles all internal buffers.float
search(Query query)
Convenience method that efficiently returns the relevance score by matching this index against the given Lucene query expression.void
setSimilarity(Similarity similarity)
Set the Similarity to be used for calculating field normsprivate static SortedDocValues
sortedDocValues(BytesRef value)
private static SortedSetDocValues
sortedSetDocValues(BytesRefHash values, int[] bytesIds)
private void
storeDocValues(MemoryIndex.Info info, DocValuesType docValuesType, java.lang.Object docValuesValue)
private void
storePointValues(MemoryIndex.Info info, BytesRef pointValue)
private void
storeTerm(MemoryIndex.Info info, BytesRef term)
private void
storeTerms(MemoryIndex.Info info, TokenStream tokenStream, int positionIncrementGap, int offsetGap)
private void
storeValues(MemoryIndex.Info info, IndexableField field)
java.lang.String
toStringDebug()
Returns a String representation of the index data for debugging purposes.
-
-
-
Field Detail
-
DEBUG
private static final boolean DEBUG
- See Also:
- Constant Field Values
-
fields
private final java.util.SortedMap<java.lang.String,MemoryIndex.Info> fields
info for each field: Map<String fieldName, Info field>
-
storeOffsets
private final boolean storeOffsets
-
storePayloads
private final boolean storePayloads
-
byteBlockPool
private final ByteBlockPool byteBlockPool
-
slicedIntBlockPool
private final MemoryIndex.SlicedIntBlockPool slicedIntBlockPool
-
postingsWriter
private final MemoryIndex.SlicedIntBlockPool.SliceWriter postingsWriter
-
payloadsBytesRefs
private final BytesRefArray payloadsBytesRefs
-
bytesUsed
private Counter bytesUsed
-
frozen
private boolean frozen
-
normSimilarity
private Similarity normSimilarity
-
defaultFieldType
private FieldType defaultFieldType
-
-
Constructor Detail
-
MemoryIndex
public MemoryIndex()
Constructs an empty instance that will not store offsets or payloads.
-
MemoryIndex
public MemoryIndex(boolean storeOffsets)
Constructs an empty instance that can optionally store the start and end character offset of each token term in the text. This can be useful for highlighting of hit locations with the Lucene highlighter package. But it will not store payloads; use another constructor for that.- Parameters:
storeOffsets
- whether or not to store the start and end character offset of each token term in the text
-
MemoryIndex
public MemoryIndex(boolean storeOffsets, boolean storePayloads)
Constructs an empty instance with the option of storing offsets and payloads.- Parameters:
storeOffsets
- store term offsets at each positionstorePayloads
- store term payloads at each position
-
MemoryIndex
MemoryIndex(boolean storeOffsets, boolean storePayloads, long maxReusedBytes)
Expert: This constructor accepts an upper limit for the number of bytes that should be reused if this instance isreset()
. The payload storage, if used, is unaffected by maxReusuedBytes, however.- Parameters:
storeOffsets
-true
if offsets should be storedstorePayloads
-true
if payloads should be storedmaxReusedBytes
- the number of bytes that should remain in the internal memory pools afterreset()
is called
-
-
Method Detail
-
addField
public void addField(java.lang.String fieldName, java.lang.String text, Analyzer analyzer)
Convenience method; Tokenizes the given field text and adds the resulting terms to the index; Equivalent to adding an indexed non-keyword LuceneField
that is tokenized, not stored, termVectorStored with positions (or termVectorStored with positions and offsets),- Parameters:
fieldName
- a name to be associated with the texttext
- the text to tokenize and index.analyzer
- the analyzer to use for tokenization
-
fromDocument
public static MemoryIndex fromDocument(java.lang.Iterable<? extends IndexableField> document, Analyzer analyzer)
Builds a MemoryIndex from a luceneDocument
using an analyzer- Parameters:
document
- the document to indexanalyzer
- the analyzer to use- Returns:
- a MemoryIndex
-
fromDocument
public static MemoryIndex fromDocument(java.lang.Iterable<? extends IndexableField> document, Analyzer analyzer, boolean storeOffsets, boolean storePayloads)
Builds a MemoryIndex from a luceneDocument
using an analyzer- Parameters:
document
- the document to indexanalyzer
- the analyzer to usestoreOffsets
-true
if offsets should be storedstorePayloads
-true
if payloads should be stored- Returns:
- a MemoryIndex
-
fromDocument
public static MemoryIndex fromDocument(java.lang.Iterable<? extends IndexableField> document, Analyzer analyzer, boolean storeOffsets, boolean storePayloads, long maxReusedBytes)
Builds a MemoryIndex from a luceneDocument
using an analyzer- Parameters:
document
- the document to indexanalyzer
- the analyzer to usestoreOffsets
-true
if offsets should be storedstorePayloads
-true
if payloads should be storedmaxReusedBytes
- the number of bytes that should remain in the internal memory pools afterreset()
is called- Returns:
- a MemoryIndex
-
keywordTokenStream
public <T> TokenStream keywordTokenStream(java.util.Collection<T> keywords)
Convenience method; Creates and returns a token stream that generates a token for each keyword in the given collection, "as is", without any transforming text analysis. The resulting token stream can be fed intoaddField(String, TokenStream)
, perhaps wrapped into anotherTokenFilter
, as desired.- Parameters:
keywords
- the keywords to generate tokens for- Returns:
- the corresponding token stream
-
addField
public void addField(IndexableField field, Analyzer analyzer)
Adds a luceneIndexableField
to the MemoryIndex using the provided analyzer. Also stores doc values based onIndexableFieldType.docValuesType()
if set.- Parameters:
field
- the field to addanalyzer
- the analyzer to use for term analysis
-
addField
public void addField(java.lang.String fieldName, TokenStream stream)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, LuceneField
. Finally closes the token stream. Note that untokenized keywords can be added with this method viakeywordTokenStream(Collection)
, the LuceneKeywordTokenizer
or similar utilities.- Parameters:
fieldName
- a name to be associated with the textstream
- the token stream to retrieve tokens from.
-
addField
public void addField(java.lang.String fieldName, TokenStream stream, int positionIncrementGap)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, LuceneField
. Finally closes the token stream. Note that untokenized keywords can be added with this method viakeywordTokenStream(Collection)
, the LuceneKeywordTokenizer
or similar utilities.- Parameters:
fieldName
- a name to be associated with the textstream
- the token stream to retrieve tokens from.positionIncrementGap
- the position increment gap if fields with the same name are added more than once
-
addField
public void addField(java.lang.String fieldName, TokenStream tokenStream, int positionIncrementGap, int offsetGap)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, LuceneField
. Finally closes the token stream. Note that untokenized keywords can be added with this method viakeywordTokenStream(Collection)
, the LuceneKeywordTokenizer
or similar utilities.- Parameters:
fieldName
- a name to be associated with the texttokenStream
- the token stream to retrieve tokens from. It's guaranteed to be closed no matter what.positionIncrementGap
- the position increment gap if fields with the same name are added more than onceoffsetGap
- the offset gap if fields with the same name are added more than once
-
getInfo
private MemoryIndex.Info getInfo(java.lang.String fieldName, IndexableFieldType fieldType)
-
createFieldInfo
private FieldInfo createFieldInfo(java.lang.String fieldName, int ord, IndexableFieldType fieldType)
-
storePointValues
private void storePointValues(MemoryIndex.Info info, BytesRef pointValue)
-
storeValues
private void storeValues(MemoryIndex.Info info, IndexableField field)
-
storeDocValues
private void storeDocValues(MemoryIndex.Info info, DocValuesType docValuesType, java.lang.Object docValuesValue)
-
storeTerm
private void storeTerm(MemoryIndex.Info info, BytesRef term)
-
storeTerms
private void storeTerms(MemoryIndex.Info info, TokenStream tokenStream, int positionIncrementGap, int offsetGap)
-
setSimilarity
public void setSimilarity(Similarity similarity)
Set the Similarity to be used for calculating field norms- Parameters:
similarity
- instance with customSimilarity.computeNorm(org.apache.lucene.index.FieldInvertState)
implementation
-
createSearcher
public IndexSearcher createSearcher()
Creates and returns a searcher that can be used to execute arbitrary Lucene queries and to collect the resulting query results as hits.- Returns:
- a searcher
-
freeze
public void freeze()
Prepares the MemoryIndex for querying in a non-lazy way.After calling this you can query the MemoryIndex from multiple threads, but you cannot subsequently add new data.
-
search
public float search(Query query)
Convenience method that efficiently returns the relevance score by matching this index against the given Lucene query expression.- Parameters:
query
- an arbitrary Lucene query to run against this index- Returns:
- the relevance score of the matchmaking; A number in the range [0.0 .. 1.0], with 0.0 indicating no match. The higher the number the better the match.
-
toStringDebug
public java.lang.String toStringDebug()
Returns a String representation of the index data for debugging purposes.- Returns:
- the string representation
-
numericDocValues
private static SortedNumericDocValues numericDocValues(long[] values, int count)
-
numericDocValues
private static NumericDocValues numericDocValues(long value)
-
sortedDocValues
private static SortedDocValues sortedDocValues(BytesRef value)
-
sortedSetDocValues
private static SortedSetDocValues sortedSetDocValues(BytesRefHash values, int[] bytesIds)
-
reset
public void reset()
Resets theMemoryIndex
to its initial state and recycles all internal buffers.
-
-