Type ViterbiParse
object
--+
|
ParseI
--+
|
AbstractParse
--+
|
ViterbiParse
A bottom-up PCFG
parser that uses dynamic programming to
find the single most likely parse for a text. The
ViterbiParse
parser parses texts by filling in a most
likely constituent table. This table records the most probable tree
representation for any given span and node value. In particular, it has
an entry for every start index, end index, and node value, recording the
most likely subtree that spans from the start index to the end index, and
has the given node value.
The ViterbiParse
parser fills in this table
incrementally. It starts by filling in all entries for constituents that
span one element of text (i.e., entries where the end index is one
greater than the start index). After it has filled in all table entries
for constituents that span one element of text, it fills in the entries
for constitutants that span two elements of text. It continues filling in
the entries for constituents spanning larger and larger portions of the
text, until the entire table has been filled. Finally, it returns the
table entry for a constituent spanning the entire text, whose node value
is the grammar's start symbol.
In order to find the most likely constituent with a given span and
node value, the ViterbiParse
parser considers all
productions that could produce that node value. For each production, it
finds all children that collectively cover the span and have the node
values specified by the production's right hand side. If the probability
of the tree formed by applying the production to the children is greater
than the probability of the current entry in the table, then the table is
updated with this new tree.
A pseudo-code description of the algorithm used by
ViterbiParse
is:
-
Create an empty most likely constituent table, MLC.
-
For width in 1...len(text):
-
For start in 1...len(text)-width:
-
For prod in grammar.productions:
-
For each sequence of subtrees [t[1], t[2], ..., t[n]] in MLC, where
t[i].node==prod.rhs[i], and the sequence covers [start:start+width]:
-
old_p = MLC[start, start+width, prod.lhs]
-
new_p = P(t[1])*P(t[1])*...*P(t[n])*P(prod)
-
if new_p > old_p:
-
new_tree = Tree(prod.lhs, t[1], t[2],
..., t[n])
-
MLC[start, start+width, prod.lhs] = new_tree
-
Return MLC[0, len(text),
start_symbol]
Method Summary |
|
__init__ (self,
grammar,
trace)
Create a new ViterbiParse parser, that uses {grammar} to
parse texts. |
|
__repr__(self)
|
|
get_parse_list(self,
tokens)
|
None
|
trace (self,
trace)
Set the level of tracing output that should be generated when parsing
a text. |
None
|
_add_constituents_spanning (self,
span,
constituents,
tokens)
Find any constituents that might cover span , and add them
to the most likely constituents table. |
|
_find_instantiations (self,
span,
constituents)
Return a list of the production instantiations that cover a given span of the
text. |
list of (list of
ProbabilisticTree or Token )
|
_match_rhs (self,
rhs,
span,
constituents)
Return a set of all the lists of children that cover span and
that match rhs . |
|
_trace_lexical_insertion(self,
token,
index,
width)
|
None
|
_trace_production (self,
production,
p,
span,
width)
Print trace output indicating that a given production has been applied
at a given location. |
Inherited from AbstractParse :
get_parse ,
grammar ,
parse
Inherited from ParseI :
get_parse_probs
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__setattr__ ,
__str__
|
Instance Variable Summary |
pcfg.Grammar |
_grammar : The grammar used to parse sentences. |
int |
_trace : The level of tracing output that should be generated when parsing a
text. |
__init__(self,
grammar,
trace=0)
(Constructor)
Create a new ViterbiParse parser, that uses {grammar}
to parse texts.
-
- Parameters:
grammar -
The grammar used to parse texts.
(type=pcfg.Grammar )
trace -
The level of tracing that should be used when parsing a text.
0 will generate no tracing output; and higher
numbers will produce more verbose tracing output.
(type=int )
- Overrides:
nltk_lite.parse.AbstractParse.__init__
|
trace(self,
trace=2)
Set the level of tracing output that should be generated when
parsing a text.
-
- Parameters:
trace -
The trace level. A trace level of 0 will generate
no tracing output; and higher trace levels will produce more
verbose tracing output.
(type=int )
- Returns:
-
None
|
_add_constituents_spanning(self,
span,
constituents,
tokens)
Find any constituents that might cover span , and add
them to the most likely constituents table.
-
- Parameters:
span -
The section of the text for which we are trying to find
possible constituents. The span is specified as a pair of
integers, where the first integer is the index of the first token
that should be included in the constituent; and the second
integer is the index of the first token that should not be
included in the constituent. I.e., the constituent should cover
text[span[0]:span[1]] , where
text is the text that we are
parsing.
(type=(int, int) )
constituents -
The most likely constituents table. This table records the
most probable tree representation for any given span and node
value. In particular, constituents(s,e,nv) is the most
likely ProbabilisticTree that covers text[s:e] and has a node value nv.symbol() , where text is the text that we are parsing.
When _add_constituents_spanning is called,
constituents should contain all possible
constituents that are shorter than span .
(type=dictionary from
(int,int,Nonterminal) to
(ProbabilisticToken or
ProbabilisticTree ).)
tokens -
The text we are parsing. This is only used for trace
output.
(type=list of tokens)
- Returns:
-
None
|
_find_instantiations(self,
span,
constituents)
-
- Parameters:
span -
The section of the text for which we are trying to find
production instantiations. The span is specified as a pair of
integers, where the first integer is the index of the first token
that should be covered by the production instantiation; and the
second integer is the index of the first token that should not be
covered by the production instantiation.
(type=(int, int) )
constituents -
The most likely constituents table. This table records the
most probable tree representation for any given span and node
value. See the module documentation for more information.
(type=dictionary from
(int,int,Nonterminal) to
(ProbabilisticToken or
ProbabilisticTree ).)
- Returns:
-
a list of the production instantiations that cover a given
span of the text. A production instantiation is a tuple
containing a production and a list of children, where the
production's right hand side matches the list of children; and
the children cover
span . @rtype:
list of pair of
Production , (list of
(ProbabilisticTree or token.
|
_match_rhs(self,
rhs,
span,
constituents)
-
- Parameters:
rhs -
The list specifying what kinds of children need to cover
span . Each nonterminal in rhs specifies
that the corresponding child should be a tree whose node value is
that nonterminal's symbol. Each terminal in rhs
specifies that the corresponding child should be a token whose
type is that terminal.
(type=list of Nonterminal or
(any))
span -
The section of the text for which we are trying to find child
lists. The span is specified as a pair of integers, where the
first integer is the index of the first token that should be
covered by the child list; and the second integer is the index of
the first token that should not be covered by the child list.
(type=(int, int) )
constituents -
The most likely constituents table. This table records the
most probable tree representation for any given span and node
value. See the module documentation for more information.
(type=dictionary from
(int,int,Nonterminal) to
(ProbabilisticToken or
ProbabilisticTree ).)
- Returns:
-
a set of all the lists of children that cover
span and that match rhs .
(type=list of (list of
ProbabilisticTree or Token ))
|
_trace_production(self,
production,
p,
span,
width)
Print trace output indicating that a given production has been
applied at a given location.
-
- Parameters:
production -
The production that has been applied
(type=Production )
p -
The probability of the tree produced by the production.
(type=float )
span -
The span of the production
(type=tuple )
- Returns:
-
None
|
Instance Variable Details |
_grammar
The grammar used to parse sentences.
-
- Type:
-
pcfg.Grammar
|
_trace
The level of tracing output that should be generated when parsing a
text.
-
- Type:
-
int
|