- java.lang.Object
-
- org.apache.lucene.analysis.WordlistLoader
-
public class WordlistLoader extends java.lang.Object
Loader for text files that represent a list of stopwords.- See Also:
to obtain Reader instances
-
-
Field Summary
Fields Modifier and Type Field Description private static int
INITIAL_CAPACITY
-
Constructor Summary
Constructors Modifier Constructor Description private
WordlistLoader()
no instance
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description private static java.io.BufferedReader
getBufferedReader(java.io.Reader reader)
static java.util.List<java.lang.String>
getLines(java.io.InputStream stream, java.nio.charset.Charset charset)
Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.static CharArraySet
getSnowballWordSet(java.io.InputStream stream)
Reads stopwords from a stopword list in Snowball format.static CharArraySet
getSnowballWordSet(java.io.InputStream stream, java.nio.charset.Charset charset)
Reads stopwords from a stopword list in Snowball format.static CharArraySet
getSnowballWordSet(java.io.Reader reader)
Reads stopwords from a stopword list in Snowball format.static CharArraySet
getSnowballWordSet(java.io.Reader reader, CharArraySet result)
Reads stopwords from a stopword list in Snowball format.static CharArrayMap<java.lang.String>
getStemDict(java.io.Reader reader, CharArrayMap<java.lang.String> result)
Reads a stem dictionary.static CharArraySet
getWordSet(java.io.InputStream stream)
Reads lines from an InputStream with UTF-8 charset and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.InputStream stream, java.lang.String comment)
Reads lines from an InputStream with UTF-8 charset and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.InputStream stream, java.nio.charset.Charset charset)
Reads lines from an InputStream with the given charset and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.InputStream stream, java.nio.charset.Charset charset, java.lang.String comment)
Reads lines from an InputStream with the given charset and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.Reader reader)
Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.Reader reader, java.lang.String comment)
Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.Reader reader, java.lang.String comment, CharArraySet result)
Reads lines from a Reader and adds every non-blank non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).static CharArraySet
getWordSet(java.io.Reader reader, CharArraySet result)
Reads lines from a Reader and adds every non-blank line as an entry to a CharArraySet (omitting leading and trailing whitespace).
-
-
-
Field Detail
-
INITIAL_CAPACITY
private static final int INITIAL_CAPACITY
- See Also:
- Constant Field Values
-
-
Method Detail
-
getWordSet
public static CharArraySet getWordSet(java.io.Reader reader, CharArraySet result) throws java.io.IOException
Reads lines from a Reader and adds every non-blank line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
reader
- Reader containing the wordlistresult
- theCharArraySet
to fill with the readers words- Returns:
- the given
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.Reader reader) throws java.io.IOException
Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
reader
- Reader containing the wordlist- Returns:
- An unmodifiable
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.InputStream stream) throws java.io.IOException
Reads lines from an InputStream with UTF-8 charset and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
stream
- InputStream containing the wordlist- Returns:
- An unmodifiable
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.InputStream stream, java.nio.charset.Charset charset) throws java.io.IOException
Reads lines from an InputStream with the given charset and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
stream
- InputStream containing the wordlistcharset
- Charset of the wordlist- Returns:
- An unmodifiable
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.Reader reader, java.lang.String comment, CharArraySet result) throws java.io.IOException
Reads lines from a Reader and adds every non-blank non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
reader
- Reader containing the wordlistcomment
- The string representing a comment.result
- theCharArraySet
to fill with the readers words- Returns:
- the given
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.Reader reader, java.lang.String comment) throws java.io.IOException
Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
reader
- Reader containing the wordlistcomment
- The string representing a comment.- Returns:
- An unmodifiable CharArraySet with the reader's words
- Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.InputStream stream, java.lang.String comment) throws java.io.IOException
Reads lines from an InputStream with UTF-8 charset and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
stream
- InputStream in UTF-8 encoding containing the wordlistcomment
- The string representing a comment.- Returns:
- An unmodifiable CharArraySet with the reader's words
- Throws:
java.io.IOException
-
getWordSet
public static CharArraySet getWordSet(java.io.InputStream stream, java.nio.charset.Charset charset, java.lang.String comment) throws java.io.IOException
Reads lines from an InputStream with the given charset and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).- Parameters:
stream
- InputStream containing the wordlistcharset
- Charset of the wordlistcomment
- The string representing a comment.- Returns:
- An unmodifiable CharArraySet with the reader's words
- Throws:
java.io.IOException
-
getSnowballWordSet
public static CharArraySet getSnowballWordSet(java.io.Reader reader, CharArraySet result) throws java.io.IOException
Reads stopwords from a stopword list in Snowball format.The snowball format is the following:
- Lines may contain multiple words separated by whitespace.
- The comment character is the vertical line (|).
- Lines may contain trailing comments.
- Parameters:
reader
- Reader containing a Snowball stopword listresult
- theCharArraySet
to fill with the readers words- Returns:
- the given
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getSnowballWordSet
public static CharArraySet getSnowballWordSet(java.io.Reader reader) throws java.io.IOException
Reads stopwords from a stopword list in Snowball format.The snowball format is the following:
- Lines may contain multiple words separated by whitespace.
- The comment character is the vertical line (|).
- Lines may contain trailing comments.
- Parameters:
reader
- Reader containing a Snowball stopword list- Returns:
- An unmodifiable
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getSnowballWordSet
public static CharArraySet getSnowballWordSet(java.io.InputStream stream) throws java.io.IOException
Reads stopwords from a stopword list in Snowball format.The snowball format is the following:
- Lines may contain multiple words separated by whitespace.
- The comment character is the vertical line (|).
- Lines may contain trailing comments.
- Parameters:
stream
- InputStream in UTF-8 encoding containing a Snowball stopword list- Returns:
- An unmodifiable
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getSnowballWordSet
public static CharArraySet getSnowballWordSet(java.io.InputStream stream, java.nio.charset.Charset charset) throws java.io.IOException
Reads stopwords from a stopword list in Snowball format.The snowball format is the following:
- Lines may contain multiple words separated by whitespace.
- The comment character is the vertical line (|).
- Lines may contain trailing comments.
- Parameters:
stream
- InputStream containing a Snowball stopword listcharset
- Charset of the stopword list- Returns:
- An unmodifiable
CharArraySet
with the reader's words - Throws:
java.io.IOException
-
getStemDict
public static CharArrayMap<java.lang.String> getStemDict(java.io.Reader reader, CharArrayMap<java.lang.String> result) throws java.io.IOException
Reads a stem dictionary. Each line contains:word\tstem
(i.e. two tab separated words)- Returns:
- stem dictionary that overrules the stemming algorithm
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
getLines
public static java.util.List<java.lang.String> getLines(java.io.InputStream stream, java.nio.charset.Charset charset) throws java.io.IOException
Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.A comment line is any line that starts with the character "#"
- Returns:
- a list of non-blank non-comment lines with whitespace trimmed
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
getBufferedReader
private static java.io.BufferedReader getBufferedReader(java.io.Reader reader)
-
-