org.apache.lucene.index.memory

Class SynonymMap

public class SynonymMap extends Object

Loads the WordNet prolog file wn_s.pl into a thread-safe main-memory hash map that can be used for fast high-frequency lookups of synonyms for any given (lowercase) word string.

There holds: If B is a synonym for A (A -> B) then A is also a synonym for B (B -> A). There does not necessarily hold: A -> B, B -> C then A -> C.

Loading typically takes some 1.5 secs, so should be done only once per (server) program execution, using a singleton pattern. Once loaded, a synonym lookup via getSynonymstakes constant time O(1). A loaded default synonym map consumes about 10 MB main memory. An instance is immutable, hence thread-safe.

This implementation borrows some ideas from the Lucene Syns2Index demo that Dave Spencer originally contributed to Lucene. Dave's approach involved a persistent Lucene index which is suitable for occasional lookups or very large synonym tables, but considered unsuitable for high-frequency lookups of medium size synonym tables.

Example Usage:

 String[] words = new String[] { "hard", "woods", "forest", "wolfish", "xxxx"};
 SynonymMap map = new SynonymMap(new FileInputStream("samples/fulltext/wn_s.pl"));
 for (int i = 0; i < words.length; i++) {
     String[] synonyms = map.getSynonyms(words[i]);
     System.out.println(words[i] + ":" + java.util.Arrays.asList(synonyms).toString());
 }
 
 Example output:
 hard:[arduous, backbreaking, difficult, fermented, firmly, grueling, gruelling, heavily, heavy, intemperately, knockout, laborious, punishing, severe, severely, strong, toilsome, tough]
 woods:[forest, wood]
 forest:[afforest, timber, timberland, wood, woodland, woods]
 wolfish:[edacious, esurient, rapacious, ravening, ravenous, voracious, wolflike]
 xxxx:[]
 

Author: whoschek.AT.lbl.DOT.gov

See Also: prologdb man page Dave's synonym demo site

Constructor Summary
SynonymMap(InputStream input)
Constructs an instance, loading WordNet synonym data from the given input stream.
Method Summary
protected Stringanalyze(String word)
Analyzes/transforms the given word on input stream loading.
String[]getSynonyms(String word)
Returns the synonym set for the given word, sorted ascending.
StringtoString()
Returns a String representation of the index data for debugging purposes.

Constructor Detail

SynonymMap

public SynonymMap(InputStream input)
Constructs an instance, loading WordNet synonym data from the given input stream. Finally closes the stream. The words in the stream must be in UTF-8 or a compatible subset (for example ASCII, MacRoman, etc.).

Parameters: input the stream to read from (null indicates an empty synonym map)

Throws: IOException if an error occured while reading the stream.

Method Detail

analyze

protected String analyze(String word)
Analyzes/transforms the given word on input stream loading. This default implementation simply lowercases the word. Override this method with a custom stemming algorithm or similar, if desired.

Parameters: word the word to analyze

Returns: the same word, or a different word (or null to indicate that the word should be ignored)

getSynonyms

public String[] getSynonyms(String word)
Returns the synonym set for the given word, sorted ascending.

Parameters: word the word to lookup (must be in lowercase).

Returns: the synonyms; a set of zero or more words, sorted ascending, each word containing lowercase characters that satisfy Character.isLetter().

toString

public String toString()
Returns a String representation of the index data for debugging purposes.

Returns: a String representation

Copyright © 2000-2007 Apache Software Foundation. All Rights Reserved.