| 15-482 / 11-682 - Human Language Technologies (Fall 2006) Jamie Callan Alon Lavie Alan Black |
Due: Sunday, Nov. 19, 2006 |
Your task in this assignment is to develop and implement basic algorithms for word- aligning a given sentence-aligned parallel corpus between French and English, and for extracting a bilingual word translation lexicon from the word-aligned corpus. The quality of the bilingual lexicon that you develop will be evaluated by automatically comparing it to a given "gold-standard" high-quality lexicon.
Implement a basic word-alignment algorithm and an algorithm for extracting
a word translation lexicon from the sentence-aligned corpus.
Training: Your developed module will be trained on a corpus of sentence-aligned
parallel text. The corpus is a collection of pairs of files, where in each pair,
one file contains sentences of language L1 and the other file contains the corresponding
sentences for language L2. Each sentence starts on a new line. For each pair
of corresponding files, the number of lines is the same.
Training Output: A bilingual word translation lexicon. The format of the lexicon
file should be one "entry" per line, where each entry consists of
a source-language word, a "tab", followed by a target language word
or multi-word translation. If a word has multiple possible translations, each
should appear on a separate line.
Retrieve your training data from the following URL:
http://www.isi.edu/natural-language/download/hansard/index.html
Download the following training and testing corpora:
Training: The Senate Debates Training Set (182K sentence pairs)
Testing: The Senate Debates Testing Sets (25K sentence pairs)
Combine these two sets into a single bilingual corpus. This is your training corpus.
Train your algorithms on your training corpus, and use that information to create a (large) French-to-English bilingual word translation lexicon.
Note: You are not expected to develop a very complex word-alignment
algorithm for this assignment. You may wish to start with a simple
approach based on counts (or relative counts) of French/English word
pairs that co-occur in the parallel sentences. It is important,
however, that you develop an implementation that can handle the large
amounts of training data, and that you find effective solutions to
the memory and runtime issues you encounter when handling such large
amounts of data.
Write a program that can extract from your bilingual lexicon the subset of entries consisting of only the French words that appear in a given list of words. The words will be provided in a text file, one word per line.
In order to compute a quality score for your lexicon, write a program that
given a "gold-standard" lexicon, calculates aggregate Precision, Recall
and F1 measures for your lexicon. The measures are to be calculated as follows:
Calculate the total number of entries in your lexicon - CL
Calculate the total number of entries in the "gold-standard" lexicon
- CG
Calculate the total number of entries that are identical in both lexicons -
C
Precision = C/CL ; Recall = C/CG ; F1 = 2*P*R/(P+R)
A devtest set "gold-standard" lexicon (with answers) will be provided to you, and the final test set of French words alone will be provided to you one week before the homework is due (Nov 7).
Use your generated lexicon to produce an equivalent lexicon to the devtest gold-standard (tab separated, French and English words). This is what you will use to calculate the precision, recall, and F1 measures. You may want to first extract the French words from the devtest lexicon to do this step.
Use your generated lexicon to create a bilingual lexicon for the list of provided French words, (tab separated, French and English words).
Finally, submit all of the above, along with a write-up explaining in detail how you did the work (and any intermediate results you found, e.g. on the test set), as well as design decisions you made, and how you dealt with the challenges of a large data set.
You should submit the following items:
1. Your source code.
2. Your extracted bilingual word translation lexicon.
3. Your write up explaining what you did, your reasons for doing things the way you did, how you dealt with the large data set, along with your precision, recall and F1 results
on the devtest lexicon.
There are also some hints for how to approach this assignment if you're having trouble.