15-482 / 11-682 - Human Language Technologies (Fall 2006)
Jamie Callan
Alon Lavie
Alan Black
Due: Thursday, October 26, 2006

FALL 2006

HOMEWORK 2: Natural Language Processing



If you have questions, try the FAQ and some helpful things to know.

Introduction

Your task in this homework assignment is to extend a unification-based grammar and the lexicon it works with, in order to support the parsing of natural language questions about movies (and other media types) that appear in the IMDB database. The target output representation is in the form of semantic frames, which can then be fed into a back-end query system that retrieves the requested information from the database.

To perform this task, you will be using a parser, a morphological processing utility and lexicon access functions which are all already implemented in Lisp, and which you are not allowed to modify. A basic grammar and lexicon, which already can parse a subset of the required sentences are also provided. You may only extend or modify the grammar and the lexicon in order to parse all the given sentences in the test set as required.

The provided basic grammar and lexicon are in the following files:

movie-grammar.gra
movie.dict

For the purpose of grammar development and testing, you should use the web interface to the parser at:

http://boston.lti.cs.cmu.edu/classes/11-682/HW/HW2/Server/parse.html

The web interface supports loading your grammar and dictionary files, and then running the parser on individual sentences or the entire example set. The parser behind the web interface will load your grammar and lexicon, parse the input, and return the output, as well as an indication of whether it matches the correct parse for the sentence. A "debug" mode is also provided, to help you isolate sources of error in your grammar (see more information below).

Your grammar and lexicon will be evaluated for both coverage and ability to generalize. It's possible to write a trivial input-matching "flat" grammar that simply contains a single grammar rule for each of the possible given inputs, but this grammar obviously has no ability to generalize beyond the examples. Your solution will thus be evaluated based on correctness (correct parses are produced for the example sentences), conciseness (minimizing number of grammar rules required to cover the example set), and ability to generalize (ability to parse a small set of unseen sentences which are not provided to you but used as test questions for grading). You can check the performance of your grammar and lexicon at:

http://boston.lti.cs.cmu.edu/classes/11-682/HW/HW2/Server/test.html

The development test-set you need to develop your grammar to parse correctly is accessible via the web interface, and is also in the test.txt file.

 

Semantic Representation

The natural language input is intended to be parsed into a symbolic semantic (rather than syntactic) representation, in the form of a simple feature-structure consisting of slot-value pairs. The following is a specification of the supported semantic representation forms:

<form> := (<pairs>)
<pairs> := <pair> <pairs> | e 
<pair> := (<slot> <value>) 
<slot> := actor | director | title | year | medium | genre 
<value> := <string> | *query | *count 

A valid form contains a single slot-value pair with the value of either *query or *count, and at least one additional slot-value pair. A "query" type form is intended to query the database for values that fill the slot in question, and satisfy the remaining specified slot-value information in the form. A "count" type form is intended to query the database and return a count of the number of unique values that fill the slot in question and satisfy the remaining specified slot-value information in the form. Permuting the pairs in a list doesn't change the semantic value.

For the purposes of this assignment:
The "genre" slot can be one of: sci-fi, comedy, action, thriller, romance, drama, crime.
The "medium" slot can be one of: tv, movie.

 

Some Sample Sentences and Feature Structures

(1) What actors were in Star_Wars? 
((title "Star_Wars") (actor *query)) 

(2) What year was Jaws released? ((year *query) (title "Jaws"))

(3) What comedies starred Jim_Carrey? ((genre "comedy") (medium "movie") (actor "Jim_Carrey") (title *query))

(4) What movies was Gwyneth_Paltrow in? ((actor "Gwyneth_Paltrow") (medium "movie") (title *query))

(5) Who directed [the movie] Casablanca? ((director *query) (title "Casablanca"))

(6) Who directed [the actor] Peter_Sellers? ((director *query) (actor "Peter_Sellers"))

(7) How many movies were directed by Stanley_Kubrick? ((director "Stanley_Kubrick") (medium "movie") (title *count))

(8) How many action movies starred Arnold_Schwarzenegger? ((title *count) (genre "action") (medium "movie") (actor "Arnold_Schwarzenegger"))

(9) How many thrillers were in 1989? ((genre "thriller") (medium "movie") (year "1989") (title *count))

(10) What types of movies starred Christopher_Lee? ((genre *query) (medium "movie") (actor "Christopher_Lee"))

The Grammar and Lexicon

Your starting point is a basic grammar and lexicon that already provide some partial coverage of the natural language questions that you need to cover. The grammar is written as a semantic grammar - the non-terminals in the grammar represent semantic rather than syntactic categories, which are used to cluster and represent phrases that have a similar semantic meaning relevant to the task. For example, the rule:

(<ACT-Q> <--> (<WH-ACT> <ACT-V> in <M-NAME>)
        (((x0 actor) = *query)
        ((x0 title) = (x4 title))))
reflects the fact that we can express an "actor question" (<ACT-Q>) by an "actor-type" wh-phrase (<WH-ACT>), followed by an "actor-type" verb (<ACT-V>), the word "in", and then a movie-name (<M-NAME>). The unification operations associated with the feature constraints in the rule assign values to the feature structure associated with the left-hand-side non-terminal in the rule, as described in class.

The lexicon is accessed via a provided morphological analysis procedure, which analyzes the input word into it's root form, and then matches the root form with an entry in the lexicon. The way to use this lexical access from within the grammar is demonstrated in the following grammar rule:

(<N> <--> (%)
        ((x0 <= (parse-word (x1 value)))
        ((x0 cat) =c N)))
The "wild-card" symbol "%" matches any input word. The special unification constraint form on the second line invokes the "parse-word" function on the input word, and the result is stored in the "x0" feature-structure associated with the left-hand-side non-terminal. "parse-word" morphologically analyzes the input word, looks up the root form in the lexicon, and returns a feature structure based on the lexicon entry for the root form of the word. The second unification constraint then requires that "x0" have a feature "cat" with an assigned value of "N". For example, the above rule will succeed in parsing the input word "comedies", provided the following entry for its root form appears in the lexicon:
("comedy" N U C)

The given basic lexicon contains examples of some more complex entries that provide additional feature information about the root form of certain words.

 

Parsing Names

A special function is provided for parsing "open-class" values such as movie title names and actor names. The "get-name" function returns a string containing the all-lower-cased version of the word passed to it as a parameter. It can be used inside a grammar rule in conjunction with the "wild-card" input matching token "%". To limit the applicability of such a rule, we require names in the input to start with an upper-case letter and be single tokens. A pre-processor function identifies the Capitalization and adds a token "@CAP" into the input before the name word. This allows us to then parse movie title names with the following grammar rule:

;; movie name
(<M-NAME> <--> (@CAP %)
        (((x0 title) <= (get-name (x2 value)))))

The input word "Star_Wars" would be modified into the two-word sequence "@CAP Star_Wars", and when parsed by the above rule, the function "get-name" would return the value "star_wars".

 

Debugging Your Grammar

The parser web interface provided supports a debug mode that you will find useful to debug your grammar as you extend it for increased coverage. In debug mode, the parser prints out the rules that succeed or fail as each word of the input is parsed. Here is an example:

>(pstr "what movie starred Jim_Carrey")
WHAT
rule # 28 MOVIE-GRAMMARF-72 <WH-WORD>(1) --> WHAT
killed - rule # 21 MOVIE-GRAMMARF-53 <WH-ACT> --> <WH-WORD>(1)
killed - rule # 23 MOVIE-GRAMMARF-57 <WH-TIME> --> <WH-WORD>(1)
rule # 25 MOVIE-GRAMMARF- <WH-TITLE>f) --> <WH-WORD>(1) MOVIE
killed - rule # 18 MOVIE-GRAMMARF-61 <TITLE-V> --> MOVIE(4)
rule # 32 MOVIE-GRAMMARF-83 <N>(5) --> MOVIE
rule # 31 MOVIE-GRAMMARF-81 <WH-NP>?(6) --> <WH-WORD>(5) <N>(5)
killed - rule # 22 MOVIE-GRAMMARF-55 <WH-ACT> --> <WH-NP>(6)
killed - rule # 24 MOVIE-GRAMMARF-59 <WH-TIME> --> <WH-NP>(6)
rule # 26 MOVIE-GRAMMARF-68 <WH-TITLE>I) --> <WH-NP>(7) STARRED
rule # 18 MOVIE-GRAMMARF-47 <TITLE-V>(9) --> STARRED @CAP JIM_CARREY
rule # 20 MOVIE-GRAMMARF-51 <ACT-NAME>(12) --> @CAP JIM_CARREY
rule # 8 MOVIE-GRAMMARF-16 <TITLE-Q>(13) --> <WH-TITLE>(7) <TITLE-V>(9) <ACT-NAME>(12)
rule # 4 MOVIE-GRAMMARF-8 <SENT>(14) --> <TITLE-Q>(13) 
rule # 1 MOVIE-GRAMMARF-2 <START>(15) --> <SENT>(14)

 

Report

Your report must include your grammar and lexicon, an explanation of how the grammar rules and lexicon entries are designed, examples of constructions your rules are supposed to handle, and the answers to the following questions:

1. One problem in the grammar that you have developed is that name values are not completely converted into a canonical form in the output representation. For example, the following two input sentences would not result in identical output representations:

1) Which movies starred Jim_Carrey
2) Which movies starred James_Carrey

Without transforming names into canonical forms, the database lookup may not succeed in retrieving all correct answers for a given query. Suggest two possible ways you could address this problem within the general parsing task (either within the grammar or in pre- or post- parsing processing). Indicate which of the two possible methods you think would be preferable and why. You are not required to implement either of these methods!

2. Another problem with our grammar is that it is not very effective in handling ambiguity. For example, the sentence "who directed Star_Wars" results in two different parses, one in which "Star_Wars" is interpreted as a movie title and the other where it's considered an actor name. Suggest two possible ways you could address this problem within the general parsing task (either within the grammar, during parsing, or in pre- or post-parsing processing). Indicate which of the two possible methods you think would be preferable and why. You are not required to implement either of these methods!

The requirements of your grammar and lexicon files are:

1. Use comments to distinguish the grammar rules and lexicon entries you develop from those provided in the basic grammar and lexicon files.

2. Separate the grammar rules by general type to make them easier to find.

3. Organize the rules and comment them to make it clear how rules work together.

4. Include your name or andrew ID in the comments of your grammar and lexicon files.

 

FAQ

If you have questions, see the FAQ.


Copyright 2005, Carnegie Mellon University.
Last Updated on October 4, 2006.