Class UnifiedHighlighter


  • public class UnifiedHighlighter
    extends java.lang.Object
    A Highlighter that can get offsets from either postings (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS), term vectors (FieldType.setStoreTermVectorOffsets(boolean)), or via re-analyzing text.

    This highlighter treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. It uses a BreakIterator to find passages in the text; by default it breaks using getSentenceInstance(Locale.ROOT). It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a Passage, and then scores each Passage using a separate PassageScorer. Passages are finally formatted into highlighted snippets with a PassageFormatter.

    You can customize the behavior by calling some of the setters, or by subclassing and overriding some methods. Some important hooks:

    This is thread-safe, notwithstanding the setters.

    • Field Detail

      • DEFAULT_CACHE_CHARS_THRESHOLD

        public static final int DEFAULT_CACHE_CHARS_THRESHOLD
        See Also:
        Constant Field Values
      • DEFAULT_ENABLE_MULTI_TERM_QUERY

        private static final boolean DEFAULT_ENABLE_MULTI_TERM_QUERY
        See Also:
        Constant Field Values
      • DEFAULT_ENABLE_HIGHLIGHT_PHRASES_STRICTLY

        private static final boolean DEFAULT_ENABLE_HIGHLIGHT_PHRASES_STRICTLY
        See Also:
        Constant Field Values
      • DEFAULT_ENABLE_WEIGHT_MATCHES

        private static final boolean DEFAULT_ENABLE_WEIGHT_MATCHES
        See Also:
        Constant Field Values
      • DEFAULT_ENABLE_RELEVANCY_OVER_SPEED

        private static final boolean DEFAULT_ENABLE_RELEVANCY_OVER_SPEED
        See Also:
        Constant Field Values
      • DEFAULT_BREAK_ITERATOR

        private static final java.util.function.Supplier<java.text.BreakIterator> DEFAULT_BREAK_ITERATOR
      • DEFAULT_PASSAGE_SCORER

        private static final PassageScorer DEFAULT_PASSAGE_SCORER
      • DEFAULT_PASSAGE_FORMATTER

        private static final PassageFormatter DEFAULT_PASSAGE_FORMATTER
      • DEFAULT_MAX_HIGHLIGHT_PASSAGES

        private static final int DEFAULT_MAX_HIGHLIGHT_PASSAGES
        See Also:
        Constant Field Values
      • indexAnalyzer

        protected final Analyzer indexAnalyzer
      • fieldInfos

        protected volatile FieldInfos fieldInfos
      • fieldMatcher

        private java.util.function.Predicate<java.lang.String> fieldMatcher
      • handleMultiTermQuery

        private boolean handleMultiTermQuery
      • highlightPhrasesStrictly

        private boolean highlightPhrasesStrictly
      • weightMatches

        private boolean weightMatches
      • passageRelevancyOverSpeed

        private boolean passageRelevancyOverSpeed
      • maxLength

        private int maxLength
      • breakIterator

        private java.util.function.Supplier<java.text.BreakIterator> breakIterator
      • maxNoHighlightPassages

        private int maxNoHighlightPassages
      • cacheFieldValCharsThreshold

        private int cacheFieldValCharsThreshold
    • Method Detail

      • setHandleMultiTermQuery

        @Deprecated
        public void setHandleMultiTermQuery​(boolean handleMtq)
        Deprecated.
      • setHighlightPhrasesStrictly

        @Deprecated
        public void setHighlightPhrasesStrictly​(boolean highlightPhrasesStrictly)
        Deprecated.
      • setPassageRelevancyOverSpeed

        @Deprecated
        public void setPassageRelevancyOverSpeed​(boolean passageRelevancyOverSpeed)
        Deprecated.
      • setMaxLength

        @Deprecated
        public void setMaxLength​(int maxLength)
        Deprecated.
      • setBreakIterator

        @Deprecated
        public void setBreakIterator​(java.util.function.Supplier<java.text.BreakIterator> breakIterator)
        Deprecated.
      • setScorer

        @Deprecated
        public void setScorer​(PassageScorer scorer)
        Deprecated.
      • setFormatter

        @Deprecated
        public void setFormatter​(PassageFormatter formatter)
        Deprecated.
      • setMaxNoHighlightPassages

        @Deprecated
        public void setMaxNoHighlightPassages​(int defaultMaxNoHighlightPassages)
        Deprecated.
      • setCacheFieldValCharsThreshold

        @Deprecated
        public void setCacheFieldValCharsThreshold​(int cacheFieldValCharsThreshold)
        Deprecated.
      • setFieldMatcher

        @Deprecated
        public void setFieldMatcher​(java.util.function.Predicate<java.lang.String> predicate)
        Deprecated.
      • setWeightMatches

        @Deprecated
        public void setWeightMatches​(boolean weightMatches)
        Deprecated.
      • shouldHandleMultiTermQuery

        @Deprecated
        protected boolean shouldHandleMultiTermQuery​(java.lang.String field)
        Deprecated.
        Returns whether MultiTermQuery derivatives will be highlighted. By default it's enabled. MTQ highlighting can be expensive, particularly when using offsets in postings.
      • shouldHighlightPhrasesStrictly

        @Deprecated
        protected boolean shouldHighlightPhrasesStrictly​(java.lang.String field)
        Deprecated.
        Returns whether position sensitive queries (e.g. phrases and SpanQueryies) should be highlighted strictly based on query matches (slower) versus any/all occurrences of the underlying terms. By default it's enabled, but there's no overhead if such queries aren't used.
      • shouldPreferPassageRelevancyOverSpeed

        @Deprecated
        protected boolean shouldPreferPassageRelevancyOverSpeed​(java.lang.String field)
        Deprecated.
      • extractTerms

        protected static java.util.Set<Term> extractTerms​(Query query)
        Extracts matching terms
      • getFieldMatcher

        protected java.util.function.Predicate<java.lang.String> getFieldMatcher​(java.lang.String field)
        Returns the predicate to use for extracting the query part that must be highlighted. By default only queries that target the current field are kept. (AKA requireFieldMatch)
      • getMaxLength

        public int getMaxLength()
        The maximum content size to process. Content will be truncated to this size before highlighting. Typically snippets closer to the beginning of the document better summarize its content.
      • getBreakIterator

        protected java.text.BreakIterator getBreakIterator​(java.lang.String field)
        Returns the BreakIterator to use for dividing text into passages. This returns BreakIterator.getSentenceInstance(Locale) by default; subclasses can override to customize.

        Note: this highlighter will call BreakIterator.preceding(int) and BreakIterator.next() many times on it. The default generic JDK implementation of preceding performs poorly.

      • getScorer

        protected PassageScorer getScorer​(java.lang.String field)
        Returns the PassageScorer to use for ranking passages.
      • getFormatter

        protected PassageFormatter getFormatter​(java.lang.String field)
        Returns the PassageFormatter to use for formatting passages into highlighted snippets.
      • getMaxNoHighlightPassages

        protected int getMaxNoHighlightPassages​(java.lang.String field)
        Returns the number of leading passages (as delineated by the BreakIterator) when no highlights could be found. If it's less than 0 (the default) then this defaults to the maxPassages parameter given for each request. If this is 0 then the resulting highlight is null (not formatted).
      • getCacheFieldValCharsThreshold

        public int getCacheFieldValCharsThreshold()
        Limits the amount of field value pre-fetching until this threshold is passed. The highlighter internally highlights in batches of documents sized on the sum field value length (in chars) of the fields to be highlighted (bounded by getMaxLength() for each field). By setting this to 0, you can force documents to be fetched and highlighted one at a time, which you usually shouldn't do. The default is 524288 chars which translates to about a megabyte. However, note that the highlighter sometimes ignores this and highlights one document at a time (without caching a bunch of documents in advance) when it can detect there's no point in it -- such as when all fields will be highlighted via re-analysis as one example.
      • getIndexSearcher

        public IndexSearcher getIndexSearcher()
        ... as passed in from constructor.
      • getIndexAnalyzer

        public Analyzer getIndexAnalyzer()
        ... as passed in from constructor.
      • getFieldInfo

        protected FieldInfo getFieldInfo​(java.lang.String field)
        Called by the default implementation of getOffsetSource(String). If there is no searcher then we simply always return null.
      • highlight

        public java.lang.String[] highlight​(java.lang.String field,
                                            Query query,
                                            TopDocs topDocs)
                                     throws java.io.IOException
        Highlights the top passages from a single field.
        Parameters:
        field - field name to highlight. Must have a stored string value and also be indexed with offsets.
        query - query to highlight.
        topDocs - TopDocs containing the summary result documents to highlight.
        Returns:
        Array of formatted snippets corresponding to the documents in topDocs. If no highlights were found for a document, the first sentence for the field will be returned.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
        java.lang.IllegalArgumentException - if field was indexed without IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
      • highlight

        public java.lang.String[] highlight​(java.lang.String field,
                                            Query query,
                                            TopDocs topDocs,
                                            int maxPassages)
                                     throws java.io.IOException
        Highlights the top-N passages from a single field.
        Parameters:
        field - field name to highlight. Must have a stored string value.
        query - query to highlight.
        topDocs - TopDocs containing the summary result documents to highlight.
        maxPassages - The maximum number of top-N ranked passages used to form the highlighted snippets.
        Returns:
        Array of formatted snippets corresponding to the documents in topDocs. If no highlights were found for a document, the first maxPassages sentences from the field will be returned.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
        java.lang.IllegalArgumentException - if field was indexed without IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
      • highlightFields

        public java.util.Map<java.lang.String,​java.lang.String[]> highlightFields​(java.lang.String[] fields,
                                                                                        Query query,
                                                                                        TopDocs topDocs)
                                                                                 throws java.io.IOException
        Highlights the top passages from multiple fields.

        Conceptually, this behaves as a more efficient form of:

         Map m = new HashMap();
         for (String field : fields) {
         m.put(field, highlight(field, query, topDocs));
         }
         return m;
         
        Parameters:
        fields - field names to highlight. Must have a stored string value.
        query - query to highlight.
        topDocs - TopDocs containing the summary result documents to highlight.
        Returns:
        Map keyed on field name, containing the array of formatted snippets corresponding to the documents in topDocs. If no highlights were found for a document, the first sentence from the field will be returned.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
        java.lang.IllegalArgumentException - if field was indexed without IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
      • highlightFields

        public java.util.Map<java.lang.String,​java.lang.String[]> highlightFields​(java.lang.String[] fields,
                                                                                        Query query,
                                                                                        TopDocs topDocs,
                                                                                        int[] maxPassages)
                                                                                 throws java.io.IOException
        Highlights the top-N passages from multiple fields.

        Conceptually, this behaves as a more efficient form of:

         Map m = new HashMap();
         for (String field : fields) {
         m.put(field, highlight(field, query, topDocs, maxPassages));
         }
         return m;
         
        Parameters:
        fields - field names to highlight. Must have a stored string value.
        query - query to highlight.
        topDocs - TopDocs containing the summary result documents to highlight.
        maxPassages - The maximum number of top-N ranked passages per-field used to form the highlighted snippets.
        Returns:
        Map keyed on field name, containing the array of formatted snippets corresponding to the documents in topDocs. If no highlights were found for a document, the first maxPassages sentences from the field will be returned.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
        java.lang.IllegalArgumentException - if field was indexed without IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
      • highlightFields

        public java.util.Map<java.lang.String,​java.lang.String[]> highlightFields​(java.lang.String[] fieldsIn,
                                                                                        Query query,
                                                                                        int[] docidsIn,
                                                                                        int[] maxPassagesIn)
                                                                                 throws java.io.IOException
        Highlights the top-N passages from multiple fields, for the provided int[] docids.
        Parameters:
        fieldsIn - field names to highlight. Must have a stored string value.
        query - query to highlight.
        docidsIn - containing the document IDs to highlight.
        maxPassagesIn - The maximum number of top-N ranked passages per-field used to form the highlighted snippets.
        Returns:
        Map keyed on field name, containing the array of formatted snippets corresponding to the documents in docidsIn. If no highlights were found for a document, the first maxPassages from the field will be returned.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
        java.lang.IllegalArgumentException - if field was indexed without IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
      • highlightFieldsAsObjects

        protected java.util.Map<java.lang.String,​java.lang.Object[]> highlightFieldsAsObjects​(java.lang.String[] fieldsIn,
                                                                                                    Query query,
                                                                                                    int[] docIdsIn,
                                                                                                    int[] maxPassagesIn)
                                                                                             throws java.io.IOException
        Expert: highlights the top-N passages from multiple fields, for the provided int[] docids, to custom Object as returned by the PassageFormatter. Use this API to render to something other than String.
        Parameters:
        fieldsIn - field names to highlight. Must have a stored string value.
        query - query to highlight.
        docIdsIn - containing the document IDs to highlight.
        maxPassagesIn - The maximum number of top-N ranked passages per-field used to form the highlighted snippets.
        Returns:
        Map keyed on field name, containing the array of formatted snippets corresponding to the documents in docIdsIn. If no highlights were found for a document, the first maxPassages from the field will be returned.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
        java.lang.IllegalArgumentException - if field was indexed without IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
      • calculateOptimalCacheCharsThreshold

        private int calculateOptimalCacheCharsThreshold​(int numTermVectors,
                                                        int numPostings)
        When cacheCharsThreshold is 0, loadFieldValues() only fetches one document at a time. We override it to be 0 in two circumstances:
      • copyAndSortFieldsWithMaxPassages

        private void copyAndSortFieldsWithMaxPassages​(java.lang.String[] fieldsIn,
                                                      int[] maxPassagesIn,
                                                      java.lang.String[] fields,
                                                      int[] maxPassages)
      • copyAndSortDocIdsWithIndex

        private void copyAndSortDocIdsWithIndex​(int[] docIdsIn,
                                                int[] docIds,
                                                int[] docInIndexes)
      • highlightWithoutSearcher

        public java.lang.Object highlightWithoutSearcher​(java.lang.String field,
                                                         Query query,
                                                         java.lang.String content,
                                                         int maxPassages)
                                                  throws java.io.IOException
        Highlights text passed as a parameter. This requires the IndexSearcher provided to this highlighter is null. This use-case is more rare. Naturally, the mode of operation will be UnifiedHighlighter.OffsetSource.ANALYSIS. The result of this method is whatever the PassageFormatter returns. For the DefaultPassageFormatter and assuming content has non-zero length, the result will be a non-null string -- so it's safe to call Object.toString() on it in that case.
        Parameters:
        field - field name to highlight (as found in the query).
        query - query to highlight.
        content - text to highlight.
        maxPassages - The maximum number of top-N ranked passages used to form the highlighted snippets.
        Returns:
        result of the PassageFormatter -- probably a String. Might be null.
        Throws:
        java.io.IOException - if an I/O error occurred during processing
      • getFieldHighlighter

        protected FieldHighlighter getFieldHighlighter​(java.lang.String field,
                                                       Query query,
                                                       java.util.Set<Term> allTerms,
                                                       int maxPassages)
      • getHighlightComponents

        protected UHComponents getHighlightComponents​(java.lang.String field,
                                                      Query query,
                                                      java.util.Set<Term> allTerms)
      • hasUnrecognizedQuery

        protected boolean hasUnrecognizedQuery​(java.util.function.Predicate<java.lang.String> fieldMatcher,
                                               Query query)
      • filterExtractedTerms

        protected static BytesRef[] filterExtractedTerms​(java.util.function.Predicate<java.lang.String> fieldMatcher,
                                                         java.util.Set<Term> queryTerms)
      • requiresRewrite

        protected java.lang.Boolean requiresRewrite​(SpanQuery spanQuery)
        When highlighting phrases accurately, we need to know which SpanQuery's need to have Query.rewrite(IndexSearcher) called on them. It helps performance to avoid it if it's not needed. This method will be invoked on all SpanQuery instances recursively. If you have custom SpanQuery queries then override this to check instanceof and provide a definitive answer. If the query isn't your custom one, simply return null to have the default rules apply, which govern the ones included in Lucene.
      • preSpanQueryRewrite

        protected java.util.Collection<Query> preSpanQueryRewrite​(Query query)
        When highlighting phrases accurately, we may need to handle custom queries that aren't supported in the WeightedSpanTermExtractor as called by the PhraseHelper. Should custom query types be needed, this method should be overriden to return a collection of queries if appropriate, or null if nothing to do. If the query is not custom, simply returning null will allow the default rules to apply.
        Parameters:
        query - Query to be highlighted
        Returns:
        A Collection of Query object(s) if needs to be rewritten, otherwise null.
      • asDocIdSetIterator

        private DocIdSetIterator asDocIdSetIterator​(int[] sortedDocIds)
      • loadFieldValues

        protected java.util.List<java.lang.CharSequence[]> loadFieldValues​(java.lang.String[] fields,
                                                                           DocIdSetIterator docIter,
                                                                           int cacheCharsThreshold)
                                                                    throws java.io.IOException
        Loads the String values for each docId by field to be highlighted. By default this loads from stored fields by the same name as given, but a subclass can change the source. The returned Strings must be identical to what was indexed (at least for postings or term-vectors offset sources). This method must load fields for at least one document from the given DocIdSetIterator but need not return all of them; by default the character lengths are summed and this method will return early when cacheCharsThreshold is exceeded. Specifically if that number is 0, then only one document is fetched no matter what. Values in the array of CharSequence will be null if no value was found.
        Throws:
        java.io.IOException