Home | Trees | Index | Help |
|
---|
Package nltk_lite :: Package tokenize :: Module regexp |
|
Function Summary | |
---|---|
Tokenize the text into paragraphs (separated by blank lines). | |
A demonstration that shows the output of several different tokenizers on the same string. | |
Tokenize the text into lines. | |
Tokenize the text according to the regular expression pattern. | |
Tokenize a Shoebox entry into its fields (separated by backslash markers). | |
Return an iterator that generates tokens and the gaps between them | |
Tokenize a Treebank file into its tree strings | |
Tokenize the text at whitespace. | |
Tokenize the text into sequences of alphabetic and non-alphabetic characters. |
Variable Summary | |
---|---|
str |
BLANKLINE = '\\s*\\n\\s*\\n\\s*'
|
str |
NEWLINE = '\\n'
|
str |
SHOEBOXSEP = '^\\\\'
|
str |
TREEBANK = '^\\(.*?(?=^\\(|\\Z)'
|
str |
WHITESPACE = '\\s+'
|
str |
WORDPUNCT = '[a-zA-Z]+|[^a-zA-Z\\s]+'
|
Function Details |
---|
blankline(s)Tokenize the text into paragraphs (separated by blank lines).
|
demo()A demonstration that shows the output of several different tokenizers on the same string. |
line(s)Tokenize the text into lines.
|
regexp(text, pattern, gaps=False, advanced=False)Tokenize the text according to the regular expression pattern.
|
shoebox(s)Tokenize a Shoebox entry into its fields (separated by backslash markers).
|
token_split(text, pattern, advanced=False)
|
treebank(s)Tokenize a Treebank file into its tree strings
|
whitespace(s)Tokenize the text at whitespace.
|
wordpunct(s)Tokenize the text into sequences of alphabetic and non-alphabetic characters. E.g. "She said 'hello.'" would be tokenized to ["She", "said", "'", "hello", ".'"]
|
Variable Details |
---|
BLANKLINE
|
NEWLINE
|
SHOEBOXSEP
|
TREEBANK
|
WHITESPACE
|
WORDPUNCT
|
Home | Trees | Index | Help |
|
---|
Generated by Epydoc 2.1 on Tue Sep 5 09:37:22 2006 | http://epydoc.sf.net |