| 15-482 / 11-682 - Human Language Technologies (Fall 2006) Jamie Callan Alon Lavie Alan Black |
Due: Thursday, October 26, 2006 |
Your task in this homework assignment is to extend a unification-based grammar and the lexicon it works with, in order to support the parsing of natural language questions about movies (and other media types) that appear in the IMDB database. The target output representation is in the form of semantic frames, which can then be fed into a back-end query system that retrieves the requested information from the database.
To perform this task, you will be using a parser, a morphological processing utility and lexicon access functions which are all already implemented in Lisp, and which you are not allowed to modify. A basic grammar and lexicon, which already can parse a subset of the required sentences are also provided. You may only extend or modify the grammar and the lexicon in order to parse all the given sentences in the test set as required.
The provided basic grammar and lexicon are in the following files:
movie-grammar.gra
movie.dict
For the purpose of grammar development and testing, you should use the web
interface to the parser at:
http://boston.lti.cs.cmu.edu/classes/11-682/HW/HW2/Server/parse.html
The web interface supports loading your grammar and dictionary files, and
then running the parser on individual sentences or the entire example set. The
parser behind the web interface will load your grammar and lexicon, parse
the input, and return the output, as well as an indication of whether it
matches the correct parse for the sentence. A "debug" mode is also provided,
to help you isolate sources of error in your grammar (see more information
below).
Your grammar and lexicon will be evaluated for both coverage and ability to
generalize. It's possible to write a trivial input-matching "flat" grammar that
simply contains a single grammar rule for each of the possible given inputs,
but this grammar obviously has no ability to generalize beyond the examples.
Your solution will thus be evaluated based on correctness (correct parses are
produced for the example sentences), conciseness (minimizing number of grammar
rules required to cover the example set), and ability to generalize (ability
to parse a small set of unseen sentences which are not provided to you but used
as test questions for grading). You can check the performance of your grammar
and lexicon at:
http://boston.lti.cs.cmu.edu/classes/11-682/HW/HW2/Server/test.html
The development test-set you need to develop your grammar to parse correctly is accessible via the web interface, and is also in the test.txt file.
The natural language input is intended to be parsed into a symbolic semantic
(rather than syntactic) representation, in the form of a simple
feature-structure consisting of slot-value pairs. The following is a
specification of the supported semantic representation forms:
<form> := (<pairs>) <pairs> := <pair> <pairs> | e <pair> := (<slot> <value>) <slot> := actor | director | title | year | medium | genre <value> := <string> | *query | *count
A valid form contains a single slot-value pair with the value of either *query or *count, and at least one additional slot-value pair. A "query" type form is intended to query the database for values that fill the slot in question, and satisfy the remaining specified slot-value information in the form. A "count" type form is intended to query the database and return a count of the number of unique values that fill the slot in question and satisfy the remaining specified slot-value information in the form. Permuting the pairs in a list doesn't change the semantic value.
For the purposes of this assignment:
The "genre" slot can be one of: sci-fi, comedy, action, thriller, romance, drama,
crime.
The "medium" slot can be one of: tv, movie.
(1) What actors were in Star_Wars? ((title "Star_Wars") (actor *query))
(2) What year was Jaws released? ((year *query) (title "Jaws"))
(3) What comedies starred Jim_Carrey? ((genre "comedy") (medium "movie") (actor "Jim_Carrey") (title *query))
(4) What movies was Gwyneth_Paltrow in? ((actor "Gwyneth_Paltrow") (medium "movie") (title *query))
(5) Who directed [the movie] Casablanca? ((director *query) (title "Casablanca"))
(6) Who directed [the actor] Peter_Sellers? ((director *query) (actor "Peter_Sellers"))
(7) How many movies were directed by Stanley_Kubrick? ((director "Stanley_Kubrick") (medium "movie") (title *count))
(8) How many action movies starred Arnold_Schwarzenegger? ((title *count) (genre "action") (medium "movie") (actor "Arnold_Schwarzenegger"))
(9) How many thrillers were in 1989? ((genre "thriller") (medium "movie") (year "1989") (title *count))
(10) What types of movies starred Christopher_Lee? ((genre *query) (medium "movie") (actor "Christopher_Lee"))
Your starting point is a basic grammar and lexicon that already provide some
partial coverage of the natural language questions that you need to cover.
The grammar is written as a semantic grammar - the non-terminals in the
grammar represent semantic rather than syntactic categories, which are used
to cluster and represent phrases that have a similar semantic meaning
relevant to the task. For example, the rule:
(<ACT-Q> <--> (<WH-ACT> <ACT-V> in <M-NAME>)
(((x0 actor) = *query)
((x0 title) = (x4 title))))
reflects the fact that we can express an "actor question" (<ACT-Q>) by an
"actor-type" wh-phrase (<WH-ACT>), followed by an "actor-type" verb
(<ACT-V>), the word "in", and then a movie-name (<M-NAME>). The unification
operations associated with the feature constraints in the rule assign values
to the feature structure associated with the left-hand-side non-terminal in
the rule, as described in class.
The lexicon is accessed via a provided morphological analysis procedure,
which analyzes the input word into it's root form, and then matches the root
form with an entry in the lexicon. The way to use this lexical access from
within the grammar is demonstrated in the following grammar rule:
(<N> <--> (%)
((x0 <= (parse-word (x1 value)))
((x0 cat) =c N)))
The "wild-card" symbol "%" matches any input word. The special unification
constraint form on the second line invokes the "parse-word" function on the
input word, and the result is stored in the "x0" feature-structure
associated with the left-hand-side non-terminal. "parse-word"
morphologically analyzes the input word, looks up the root form in the
lexicon, and returns a feature structure based on the lexicon entry for the
root form of the word. The second unification constraint then requires that
"x0" have a feature "cat" with an assigned value of "N". For example, the
above rule will succeed in parsing the input word "comedies", provided the
following entry for its root form appears in the lexicon:
("comedy" N U C)
The given basic lexicon contains examples of some more complex entries that provide additional feature information about the root form of certain words.
A special function is provided for parsing "open-class" values such as movie
title names and actor names. The "get-name" function returns a string
containing the all-lower-cased version of the word passed to it as a
parameter. It can be used inside a grammar rule in conjunction with the
"wild-card" input matching token "%". To limit the applicability of such a
rule, we require names in the input to start with an upper-case letter and
be single tokens. A pre-processor function identifies the Capitalization and
adds a token "@CAP" into the input before the name word. This allows us to
then parse movie title names with the following grammar rule:
;; movie name
(<M-NAME> <--> (@CAP %)
(((x0 title) <= (get-name (x2 value)))))
The input word "Star_Wars" would be modified into the two-word sequence "@CAP Star_Wars", and when parsed by the above rule, the function "get-name" would return the value "star_wars".
The parser web interface provided supports a debug mode that you will find
useful to debug your grammar as you extend it for increased coverage. In
debug mode, the parser prints out the rules that succeed or fail as each
word of the input is parsed. Here is an example:
>(pstr "what movie starred Jim_Carrey") WHAT rule # 28 MOVIE-GRAMMARF-72 <WH-WORD>(1) --> WHAT killed - rule # 21 MOVIE-GRAMMARF-53 <WH-ACT> --> <WH-WORD>(1) killed - rule # 23 MOVIE-GRAMMARF-57 <WH-TIME> --> <WH-WORD>(1) rule # 25 MOVIE-GRAMMARF- <WH-TITLE>f) --> <WH-WORD>(1) MOVIE killed - rule # 18 MOVIE-GRAMMARF-61 <TITLE-V> --> MOVIE(4) rule # 32 MOVIE-GRAMMARF-83 <N>(5) --> MOVIE rule # 31 MOVIE-GRAMMARF-81 <WH-NP>?(6) --> <WH-WORD>(5) <N>(5) killed - rule # 22 MOVIE-GRAMMARF-55 <WH-ACT> --> <WH-NP>(6) killed - rule # 24 MOVIE-GRAMMARF-59 <WH-TIME> --> <WH-NP>(6) rule # 26 MOVIE-GRAMMARF-68 <WH-TITLE>I) --> <WH-NP>(7) STARRED rule # 18 MOVIE-GRAMMARF-47 <TITLE-V>(9) --> STARRED @CAP JIM_CARREY rule # 20 MOVIE-GRAMMARF-51 <ACT-NAME>(12) --> @CAP JIM_CARREY rule # 8 MOVIE-GRAMMARF-16 <TITLE-Q>(13) --> <WH-TITLE>(7) <TITLE-V>(9) <ACT-NAME>(12) rule # 4 MOVIE-GRAMMARF-8 <SENT>(14) --> <TITLE-Q>(13) rule # 1 MOVIE-GRAMMARF-2 <START>(15) --> <SENT>(14)
Your report must include your grammar and lexicon, an explanation of how the grammar rules and lexicon entries are designed, examples of constructions your rules are supposed to handle, and the answers to the following questions:
1. One problem in the grammar that you have developed is that name values
are not completely converted into a canonical form in the output
representation. For example, the following two input sentences would not
result in identical output representations:
1) Which movies starred Jim_Carrey
2) Which movies starred James_Carrey
Without transforming names into canonical forms, the database lookup may not
succeed in retrieving all correct answers for a given query. Suggest two
possible ways you could address this problem within the general parsing task
(either within the grammar or in pre- or post- parsing processing). Indicate
which of the two possible methods you think would be preferable and why. You
are not required to implement either of these methods!
2. Another problem with our grammar is that it is not very effective in handling ambiguity. For example, the sentence "who directed Star_Wars" results in two different parses, one in which "Star_Wars" is interpreted as a movie title and the other where it's considered an actor name. Suggest two possible ways you could address this problem within the general parsing task (either within the grammar, during parsing, or in pre- or post-parsing processing). Indicate which of the two possible methods you think would be preferable and why. You are not required to implement either of these methods!
The requirements of your grammar and lexicon files are:
1. Use comments to distinguish the grammar rules and lexicon entries you develop from those provided in the basic grammar and lexicon files.
2. Separate the grammar rules by general type to make them easier to find.
3. Organize the rules and comment them to make it clear how rules work together.
4. Include your name or andrew ID in the comments of your grammar and lexicon files.
If you have questions, see the FAQ.