Package nltk_lite :: Package tokenize :: Module regexp
[show private | hide private]
[frames | no frames]

Module nltk_lite.tokenize.regexp

Functions for tokenizing a text, based on a regular expression which matches tokens or gaps.
Function Summary
  blankline(s)
Tokenize the text into paragraphs (separated by blank lines).
  demo()
A demonstration that shows the output of several different tokenizers on the same string.
  line(s)
Tokenize the text into lines.
  regexp(text, pattern, gaps, advanced)
Tokenize the text according to the regular expression pattern.
  shoebox(s)
Tokenize a Shoebox entry into its fields (separated by backslash markers).
  token_split(text, pattern, advanced)
Return an iterator that generates tokens and the gaps between them
  treebank(s)
Tokenize a Treebank file into its tree strings
  whitespace(s)
Tokenize the text at whitespace.
  wordpunct(s)
Tokenize the text into sequences of alphabetic and non-alphabetic characters.

Variable Summary
str BLANKLINE = '\\s*\\n\\s*\\n\\s*'
str NEWLINE = '\\n'
str SHOEBOXSEP = '^\\\\'
str TREEBANK = '^\\(.*?(?=^\\(|\\Z)'
str WHITESPACE = '\\s+'
str WORDPUNCT = '[a-zA-Z]+|[^a-zA-Z\\s]+'

Function Details

blankline(s)

Tokenize the text into paragraphs (separated by blank lines).
Parameters:
s - the string or string iterator to be tokenized
           (type=string or iter(string))
Returns:
An iterator over tokens

demo()

A demonstration that shows the output of several different tokenizers on the same string.

line(s)

Tokenize the text into lines.
Parameters:
s - the string or string iterator to be tokenized
           (type=string or iter(string))
Returns:
An iterator over tokens

regexp(text, pattern, gaps=False, advanced=False)

Tokenize the text according to the regular expression pattern.
Parameters:
text - the string or string iterator to be tokenized
           (type=string or iter(string))
pattern - the regular expression
           (type=string)
gaps - set to True if the pattern matches material between tokens
           (type=boolean)
advanced - set to True if the pattern is complex, making use of () groups
           (type=boolean)
Returns:
An iterator over tokens

shoebox(s)

Tokenize a Shoebox entry into its fields (separated by backslash markers).
Parameters:
s - the string or string iterator to be tokenized
           (type=string or iter(string))
Returns:
An iterator over tokens

token_split(text, pattern, advanced=False)

Returns:
An iterator that generates tokens and the gaps between them

treebank(s)

Tokenize a Treebank file into its tree strings
Parameters:
s - the string or string iterator to be tokenized
           (type=string or iter(string))
Returns:
An iterator over tokens

whitespace(s)

Tokenize the text at whitespace.
Parameters:
s - the string or string iterator to be tokenized
           (type=string or iter(string))
Returns:
An iterator over tokens

wordpunct(s)

Tokenize the text into sequences of alphabetic and non-alphabetic characters. E.g. "She said 'hello.'" would be tokenized to ["She", "said", "'", "hello", ".'"]
Parameters:
s - the string or string iterator to be tokenized
           (type=string or iter(string))
Returns:
An iterator over tokens

Variable Details

BLANKLINE

Type:
str
Value:
'''\\s*\
\\s*\
\\s*'''                                                                

NEWLINE

Type:
str
Value:
'''\
'''                                                                    

SHOEBOXSEP

Type:
str
Value:
'^\\\\'                                                                

TREEBANK

Type:
str
Value:
'^\\(.*?(?=^\\(|\\Z)'                                                  

WHITESPACE

Type:
str
Value:
'\\s+'                                                                 

WORDPUNCT

Type:
str
Value:
'[a-zA-Z]+|[^a-zA-Z\\s]+'                                              

Generated by Epydoc 2.1 on Tue Sep 5 09:37:22 2006 http://epydoc.sf.net