- java.lang.Object
-
- org.apache.lucene.search.spell.DirectSpellChecker
-
public class DirectSpellChecker extends java.lang.Object
Simple automaton-based spellchecker.Candidates are presented directly from the term dictionary, based on Levenshtein distance. This is an alternative to
SpellChecker
if you are using an edit-distance-like metric such as Levenshtein orJaroWinklerDistance
.A practical benefit of this spellchecker is that it requires no additional datastructures (neither in RAM nor on disk) to do its work.
- See Also:
LevenshteinAutomata
,FuzzyTermsEnum
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
DirectSpellChecker.ScoreTerm
Holds a spelling correction for internal usage insideDirectSpellChecker
.
-
Field Summary
Fields Modifier and Type Field Description private float
accuracy
minimum accuracy for a term to matchprivate java.util.Comparator<SuggestWord>
comparator
the comparator to useprivate StringDistance
distance
the string distance to usestatic StringDistance
INTERNAL_LEVENSHTEIN
The default StringDistance, Damerau-Levenshtein distance implemented internally viaLevenshteinAutomata
.private boolean
lowerCaseTerms
true if the spellchecker should lowercase termsprivate int
maxEdits
maximum edit distance for candidate termsprivate int
maxInspections
maximum number of top-N inspections per suggestionprivate float
maxQueryFrequency
value in [0..1] (or absolute number >= 1) representing the maximum number of documents (of the total) a query term can appear in to be corrected.private int
maxQueryLength
maximum length of a query word to return suggestionsprivate int
minPrefix
minimum prefix for candidate termsprivate int
minQueryLength
minimum length of a query word to return suggestionsprivate float
thresholdFrequency
value in [0..1] (or absolute number >= 1) representing the minimum number of documents (of the total) where a term should appear.
-
Constructor Summary
Constructors Constructor Description DirectSpellChecker()
Creates a DirectSpellChecker with default configuration values
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description float
getAccuracy()
Get the minimal accuracy from the StringDistance for a matchjava.util.Comparator<SuggestWord>
getComparator()
Get the current comparator in use.StringDistance
getDistance()
Get the string distance metric in use.boolean
getLowerCaseTerms()
true if the spellchecker should lowercase termsint
getMaxEdits()
Get the maximum number of Levenshtein edit-distances to draw candidate terms from.int
getMaxInspections()
Get the maximum number of top-N inspections per suggestionfloat
getMaxQueryFrequency()
Get the maximum threshold of documents a query term can appear in order to provide suggestions.int
getMaxQueryLength()
Get the maximum length of a query term to return suggestionsint
getMinPrefix()
Get the minimal number of characters that must match exactlyint
getMinQueryLength()
Get the minimum length of a query term needed to return suggestionsfloat
getThresholdFrequency()
Get the minimal threshold of documents a term must appear for a matchvoid
setAccuracy(float accuracy)
Set the minimal accuracy required (default: 0.5f) from a StringDistance for a suggestion match.void
setComparator(java.util.Comparator<SuggestWord> comparator)
Set the comparator for sorting suggestions.void
setDistance(StringDistance distance)
Set the string distance metric.void
setLowerCaseTerms(boolean lowerCaseTerms)
True if the spellchecker should lowercase terms (default: true)void
setMaxEdits(int maxEdits)
Sets the maximum number of Levenshtein edit-distances to draw candidate terms from.void
setMaxInspections(int maxInspections)
Set the maximum number of top-N inspections (default: 5) per suggestion.void
setMaxQueryFrequency(float maxQueryFrequency)
Set the maximum threshold (default: 0.01f) of documents a query term can appear in order to provide suggestions.void
setMaxQueryLength(int maxQueryLength)
Set the maximum length of a query term to return suggestions.void
setMinPrefix(int minPrefix)
Sets the minimal number of initial characters (default: 1) that must match exactly.void
setMinQueryLength(int minQueryLength)
Set the minimum length of a query term (default: 4) needed to return suggestions.void
setThresholdFrequency(float thresholdFrequency)
Set the minimal threshold of documents a term must appear for a match.SuggestWord[]
suggestSimilar(Term term, int numSug, IndexReader ir)
protected java.util.Collection<DirectSpellChecker.ScoreTerm>
suggestSimilar(Term term, int numSug, IndexReader ir, int docfreq, int editDistance, float accuracy, CharsRefBuilder spare)
Provide spelling corrections based on several parameters.SuggestWord[]
suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode)
SuggestWord[]
suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode, float accuracy)
Suggest similar words.
-
-
-
Field Detail
-
INTERNAL_LEVENSHTEIN
public static final StringDistance INTERNAL_LEVENSHTEIN
The default StringDistance, Damerau-Levenshtein distance implemented internally viaLevenshteinAutomata
.Note: this is the fastest distance metric, because Damerau-Levenshtein is used to draw candidates from the term dictionary: this just re-uses the scoring.
-
maxEdits
private int maxEdits
maximum edit distance for candidate terms
-
minPrefix
private int minPrefix
minimum prefix for candidate terms
-
maxInspections
private int maxInspections
maximum number of top-N inspections per suggestion
-
accuracy
private float accuracy
minimum accuracy for a term to match
-
thresholdFrequency
private float thresholdFrequency
value in [0..1] (or absolute number >= 1) representing the minimum number of documents (of the total) where a term should appear.
-
minQueryLength
private int minQueryLength
minimum length of a query word to return suggestions
-
maxQueryLength
private int maxQueryLength
maximum length of a query word to return suggestions
-
maxQueryFrequency
private float maxQueryFrequency
value in [0..1] (or absolute number >= 1) representing the maximum number of documents (of the total) a query term can appear in to be corrected.
-
lowerCaseTerms
private boolean lowerCaseTerms
true if the spellchecker should lowercase terms
-
comparator
private java.util.Comparator<SuggestWord> comparator
the comparator to use
-
distance
private StringDistance distance
the string distance to use
-
-
Method Detail
-
getMaxEdits
public int getMaxEdits()
Get the maximum number of Levenshtein edit-distances to draw candidate terms from.
-
setMaxEdits
public void setMaxEdits(int maxEdits)
Sets the maximum number of Levenshtein edit-distances to draw candidate terms from. This value can be 1 or 2. The default is 2.Note: a large number of spelling errors occur with an edit distance of 1, by setting this value to 1 you can increase both performance and precision at the cost of recall.
-
getMinPrefix
public int getMinPrefix()
Get the minimal number of characters that must match exactly
-
setMinPrefix
public void setMinPrefix(int minPrefix)
Sets the minimal number of initial characters (default: 1) that must match exactly.This can improve both performance and accuracy of results, as misspellings are commonly not the first character.
-
getMaxInspections
public int getMaxInspections()
Get the maximum number of top-N inspections per suggestion
-
setMaxInspections
public void setMaxInspections(int maxInspections)
Set the maximum number of top-N inspections (default: 5) per suggestion.Increasing this number can improve the accuracy of results, at the cost of performance.
-
getAccuracy
public float getAccuracy()
Get the minimal accuracy from the StringDistance for a match
-
setAccuracy
public void setAccuracy(float accuracy)
Set the minimal accuracy required (default: 0.5f) from a StringDistance for a suggestion match.
-
getThresholdFrequency
public float getThresholdFrequency()
Get the minimal threshold of documents a term must appear for a match
-
setThresholdFrequency
public void setThresholdFrequency(float thresholdFrequency)
Set the minimal threshold of documents a term must appear for a match.This can improve quality by only suggesting high-frequency terms. Note that very high values might decrease performance slightly, by forcing the spellchecker to draw more candidates from the term dictionary, but a practical value such as
1
can be very useful towards improving quality.This can be specified as a relative percentage of documents such as 0.5f, or it can be specified as an absolute whole document frequency, such as 4f. Absolute document frequencies may not be fractional.
-
getMinQueryLength
public int getMinQueryLength()
Get the minimum length of a query term needed to return suggestions
-
setMinQueryLength
public void setMinQueryLength(int minQueryLength)
Set the minimum length of a query term (default: 4) needed to return suggestions.Very short query terms will often cause only bad suggestions with any distance metric.
-
getMaxQueryLength
public int getMaxQueryLength()
Get the maximum length of a query term to return suggestions
-
setMaxQueryLength
public void setMaxQueryLength(int maxQueryLength)
Set the maximum length of a query term to return suggestions.Long queries can be expensive to process and/or trigger exceptions.
-
getMaxQueryFrequency
public float getMaxQueryFrequency()
Get the maximum threshold of documents a query term can appear in order to provide suggestions.
-
setMaxQueryFrequency
public void setMaxQueryFrequency(float maxQueryFrequency)
Set the maximum threshold (default: 0.01f) of documents a query term can appear in order to provide suggestions.Very high-frequency terms are typically spelled correctly. Additionally, this can increase performance as it will do no work for the common case of correctly-spelled input terms.
This can be specified as a relative percentage of documents such as 0.5f, or it can be specified as an absolute whole document frequency, such as 4f. Absolute document frequencies may not be fractional.
-
getLowerCaseTerms
public boolean getLowerCaseTerms()
true if the spellchecker should lowercase terms
-
setLowerCaseTerms
public void setLowerCaseTerms(boolean lowerCaseTerms)
True if the spellchecker should lowercase terms (default: true)This is a convenience method, if your index field has more complicated analysis (such as StandardTokenizer removing punctuation), it's probably better to turn this off, and instead run your query terms through your Analyzer first.
If this option is not on, case differences count as an edit!
-
getComparator
public java.util.Comparator<SuggestWord> getComparator()
Get the current comparator in use.
-
setComparator
public void setComparator(java.util.Comparator<SuggestWord> comparator)
Set the comparator for sorting suggestions. The default isSuggestWordQueue.DEFAULT_COMPARATOR
-
getDistance
public StringDistance getDistance()
Get the string distance metric in use.
-
setDistance
public void setDistance(StringDistance distance)
Set the string distance metric. The default isINTERNAL_LEVENSHTEIN
Note: because this spellchecker draws its candidates from the term dictionary using Damerau-Levenshtein, it works best with an edit-distance-like string metric. If you use a different metric than the default, you might want to consider increasing
setMaxInspections(int)
to draw more candidates for your metric to rank.
-
suggestSimilar
public SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir) throws java.io.IOException
- Throws:
java.io.IOException
-
suggestSimilar
public SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode) throws java.io.IOException
- Throws:
java.io.IOException
-
suggestSimilar
public SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode, float accuracy) throws java.io.IOException
Suggest similar words.Unlike
SpellChecker
, the similarity used to fetch the most relevant terms is an edit distance, therefore typically a low value for numSug will work very well.- Parameters:
term
- Term you want to spell check onnumSug
- the maximum number of suggested wordsir
- IndexReader to find terms fromsuggestMode
- specifies when to return suggested wordsaccuracy
- return only suggested words that match with this similarity- Returns:
- sorted list of the suggested words according to the comparator
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
suggestSimilar
protected java.util.Collection<DirectSpellChecker.ScoreTerm> suggestSimilar(Term term, int numSug, IndexReader ir, int docfreq, int editDistance, float accuracy, CharsRefBuilder spare) throws java.io.IOException
Provide spelling corrections based on several parameters.- Parameters:
term
- The term to suggest spelling corrections fornumSug
- The maximum number of spelling correctionsir
- The index reader to fetch the candidate spelling corrections fromdocfreq
- The minimum document frequency a potential suggestion need to have in order to be includededitDistance
- The maximum edit distance candidates are allowed to haveaccuracy
- The minimum accuracy a suggested spelling correction needs to have in order to be includedspare
- a chars scratch- Returns:
- a collection of spelling corrections sorted by
ScoreTerm
's natural order. - Throws:
java.io.IOException
- If I/O related errors occur
-
-