Tuesday, January 28, 2014

Morpho-Syntactic Analysis


While traditional hand-built grammars often include
a rich semantics, we have found their
coverage inadequate for the logic puzzles task.
For example, the English Resource Grammar
(Copestake and Flickinger, 2000) fails to parse
any of the sentences in Figure 1 for lack of coverage
of some words and of several different syntactic
structures; and parsable simplified versions
of the text produce dozens of unranked
parse trees. For this reason, we use a broadcoverage
statistical parser (Klein and Manning,
2003) trained on the Penn Treebank. In addition
to robustness, treebank-trained statistical
parsers have the benefit of extensive research
on accurate ambiguity resolution. Qualitatively,
we have found that the output of the parser on
logic puzzles is quite good (see §10). After parsing,
each word in the resulting parse trees is
converted to base form by a stemmer.
A few tree-transformation rules are applied
on the parse trees to make them more convenient
for combinatorial semantics. Most of them
are general, e.g. imposing a binary branching
structure on verb phrases, and grouping expressions
like “more than”. A few of them correct
some parsing errors, such as nouns marked as
names and vice-versa. There is growing awareness
in the probabilistic parsing literature that
mismatches between training and test set genre
can degrade parse accuracy, and that small
amounts of correct-genre data can be more important
than large amounts of wrong-genre data
(Gildea, 2001); we have found corroborating evidence
in misparsings of noun phrases common
in puzzle texts, such as “Sculptures C and E”,
which do not appear in the Wall Street Journal
corpus. Depending on the severity of this problem,
we may hand-annotate a small amount of
puzzle texts to include in parser training data.

No comments:

Post a Comment