15-482 / 11-682 - Human Language Technologies (Fall 2006)
Jamie Callan
Alon Lavie
Alan Black
Due: Sunday, Nov. 19, 2006

FALL 2006

HOMEWORK 3: Word Alignment and Bilingual Translation Lexicon Construction



Introduction

Your task in this assignment is to develop and implement basic algorithms for word- aligning a given sentence-aligned parallel corpus between French and English, and for extracting a bilingual word translation lexicon from the word-aligned corpus. The quality of the bilingual lexicon that you develop will be evaluated by automatically comparing it to a given "gold-standard" high-quality lexicon.

Detailed Instructions

  1. Implement a basic word-alignment algorithm and an algorithm for extracting a word translation lexicon from the sentence-aligned corpus.
    Training: Your developed module will be trained on a corpus of sentence-aligned parallel text. The corpus is a collection of pairs of files, where in each pair, one file contains sentences of language L1 and the other file contains the corresponding sentences for language L2. Each sentence starts on a new line. For each pair of corresponding files, the number of lines is the same.
    Training Output: A bilingual word translation lexicon. The format of the lexicon file should be one "entry" per line, where each entry consists of a source-language word, a "tab", followed by a target language word or multi-word translation. If a word has multiple possible translations, each should appear on a separate line.

  2. Retrieve your training data from the following URL:
    http://www.isi.edu/natural-language/download/hansard/index.html

  3. Obtain the training data:
    1. Download the following training and testing corpora:
      Training: The Senate Debates Training Set (182K sentence pairs)
      Testing: The Senate Debates Testing Sets (25K sentence pairs)

    2. Combine these two sets into a single bilingual corpus. This is your training corpus.

  4. Train your algorithms on your training corpus, and use that information to create a (large) French-to-English bilingual word translation lexicon.
    Note: You are not expected to develop a very complex word-alignment algorithm for this assignment. You may wish to start with a simple approach based on counts (or relative counts) of French/English word pairs that co-occur in the parallel sentences. It is important, however, that you develop an implementation that can handle the large amounts of training data, and that you find effective solutions to the memory and runtime issues you encounter when handling such large amounts of data.

  5. Write a program that can extract from your bilingual lexicon the subset of entries consisting of only the French words that appear in a given list of words. The words will be provided in a text file, one word per line.

  6. In order to compute a quality score for your lexicon, write a program that given a "gold-standard" lexicon, calculates aggregate Precision, Recall and F1 measures for your lexicon. The measures are to be calculated as follows:
    Calculate the total number of entries in your lexicon - CL
    Calculate the total number of entries in the "gold-standard" lexicon - CG
    Calculate the total number of entries that are identical in both lexicons - C
    Precision = C/CL ; Recall = C/CG ; F1 = 2*P*R/(P+R)

  7. Evaluate your lexicon:
    1. A devtest set "gold-standard" lexicon (with answers) will be provided to you, and the final test set of French words alone will be provided to you one week before the homework is due (Nov 7).

    2. Use your generated lexicon to produce an equivalent lexicon to the devtest gold-standard (tab separated, French and English words). This is what you will use to calculate the precision, recall, and F1 measures. You may want to first extract the French words from the devtest lexicon to do this step.

    3. Use your generated lexicon to create a bilingual lexicon for the list of provided French words, (tab separated, French and English words).

    4. Finally, submit all of the above, along with a write-up explaining in detail how you did the work (and any intermediate results you found, e.g. on the test set), as well as design decisions you made, and how you dealt with the challenges of a large data set.

General Instructions and Comments

1. This assignment is to be performed individually.
2. You may implement your algorithms in any programming language/environment of your choosing.
3. Do not share your code with others or use code developed by others.
4. Do not use any external bilingual resources other than the training corpus provided
5. Consult with the instructor(s) in case of any doubts about these instructions.
6. Your grade on the assignment will be weighted as: 80% based on successfully completing the assignment, 20% based on the F1 score of your lexicon.

You should submit the following items:
1. Your source code.
2. Your extracted bilingual word translation lexicon.
3. Your write up explaining what you did, your reasons for doing things the way you did, how you dealt with the large data set, along with your precision, recall and F1 results on the devtest lexicon.


There are also some hints for how to approach this assignment if you're having trouble.


Copyright 2006, Carnegie Mellon University.