Text Analytics:
95-865 (A)
CMU logo

HW1: Frequency and Co-occurrence
Due Apr 4, 11:59pm

The purpose of this assignment is to gain hands-on experience with several ways of doing frequency and co-occurrence analysis using news and web data; and with an open-source text analysis application. It consists of four major parts:

  1. Install the Sifaka text analysis application on your computer;
  2. Use Sifaka to conduct three frequency and co-occurrence analysis experiments;
  3. Use Google to conduct two co-occurrence analysis experiments; and
  4. Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.



The first three experiments give you experience using text analytics software to learn about a new corpus that you know nothing about. The documents are Wall Street Journal articles, but they cover a timespan that you may not know much about. Your task is to learn more about the topics covered by the corpus, and then to drill down to learn more about some of the people and companies that it covers.

This assignment uses a dataset of Wall Street Journal news documents from the late 1980's and early 1990's. The documents are provided in a zip file that contains a Lucene search engine index. The dataset is large (1.6 GB), so don't wait until the last minute to download it.



Sifaka is open-source text analysis software developed by the Lemur Project that is available for Windows, Mac, and Linux operating systems. Sifaka supports a variety of search, frequency, co-occurrence, and feature vector exporting capabilities within a common GUI.

See the Sifaka Tutorial Links to learn how the software works. However, download this version of the Sifaka java archive (jar) file, because it is a slightly newer version. It does not matter where you store the software on your computer.

Note: Sifaka requires Java 8 to be installed on your computer. You can check your version of java by entering the command "java -version" into a commandline window.


Academic Integrity

This is not a group project. You must do this work on your own. You may discuss the project in general terms with other students; however, you may not share data or analyses of any kind.

See the academic integrity lecture notes for more information about what is and is not allowed.


Experiment 1: Topic Investigation

Use the Frequency tab to generate a list of the top 10,000 noun phrases, ranked by ctf and df (examine both). Identify ten topics that are covered by the Wall Street Journal. Typically a topic is supported by at least 5 phrases. Try to pick topics from different parts of the frequency spectrum, i.e., not just at the top of the list.

Repeat this analysis using one of the other entity types (person, location, or organization); you may choose the type that interests you most.

In your report, you will be asked to discuss:


Experiment 2: People Investigation

The second experiment gathers information about people. You will investigate three different methods.

2.1: Associated Topics

Use the Frequency tab to generate a list of the most frequent 10,000 people, ranked by df or ctf (your choice, informed by Experiment 1). Select a person ("Person1") that looks interesting to you; results will be better if the name contains at least two terms.

Hint: There may be a temptation to focus on famous people, companies, or events that you know about already, for example President Ronald Reagan. That seems easier, because you already know something about them. However, famous entities are associated with many events, people, and organizations, thus it will be more difficult for you (and the tools that you use) to identify important patterns. You are likely to have better results if you pick people and companies that are 'mid frequency'.

Many people discussed in the corpus were somewhat famous when the documents were written, but have become more famous as their careers have progressed. For example, Joe Biden was a Congressman from Delaware; now he is the Vice President of the United States. Investigating a person's earlier career is fine. However, if they were already very famous when they were young (e.g., Michael Jackson), it is probably better to pick someone else.

Search for documents about the Person1. You can right click on the name in the Frequency tab, but probably results will be better if you use the Search tab and enclose the name in quotes (e.g., "Joe Biden"). Read a few documents to get a sense of what this person was known for.

Use the Co-occurrence tab to find noun-phrases that co-occur with Person1. Use the name in a phrase query (i.e., use quotation marks around the name). Set the search depth to 100, the minimum frequency to 2, the number of results to 1000, and use PMI and term frequency.

Examine the results. Consider whether they reveal new questions or aspects of Person1 that you did not encounter in the documents that you sampled.

Repeat this analysis for two other people (Person2 - Person3).

2.2: Associated People

Use the Co-occurrence tab to find people that co-occur with Person1. Select a name from this list (Person1,1). Use the Search tab to find information about the relationship between Person1 and Person1,1 (e.g., using a query such as "Clark Kent" AND "Lois Lane"). Read a few documents to learn something about the relationship between these two people.

Repeat this process for four other people associated with Person1 (i.e., Person1,2 - Person1,5).

Do the same analysis for Person2 and Person3 that was done for Person1.

2.3: Associated Organizations

Use the process developed for studying associated people to study associated organizations. Cover at least five organizations associated with each of Person1 - Person3.


Experiment 3: Organization Investigation

Repeat the three types of analysis done in Experiment 2 for three organizations (Organization1 - Organization3).


Experiment 4: Using a Search Engine to Calculate PMI

Use Google to collect frequency information and calculate PMI and phi-square values for the following pairs of entities.

Use the number of matching documents provided by Google to calculate PMI and phi-square for each pair of entities. When calculating PMI, assume that the size of the English Web is 75 billion documents.

To help you debug your calculations, sample results are shown below.

Freq ("Bruce Wayne") Freq ("Jason Todd") Freq ("Bruce Wayne" AND "Jason Todd") PMI Phi-square
937,000 722,000 558,000 4.791418674 0.460242982

Don't worry if your Google frequencies are different from ours. Just make sure that your calculations of PMI and Phi-square are correct.

Note: These values were obtained using the log10 function. Use log10 instead of ln.


Experiment 5: People Investigation Using Google

For three of the people (Personi) studied in Experiment 2, submit their name as a phrase query to Google (e.g., "Jamie Callan"). Select a document at random from the 1st, 4th, 8th, 12th, 16th, and 20th search result page (i.e., 6 documents that are sampled relatively uniformly from the top 200 documents). Select the names of 1-2 people from each sampled document to create a list of 5 names that are associated with your Personi. Each of these individuals is the Personi,j in a new experiment.

For each Personi, submit the following three queries to Google.

Use the number of matching documents provided by Google to calculate PMI and phi-square for the 5 people associated with each Personi. When calculating PMI, assume that the size of the English Web is 75 billion documents.


What to Submit

You must submit a report. If you are unable to access Blackboard, you may submit your files to the TAs by email, however this option is intended as a last-resort method. Use Blackboard if at all possible.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format.

Name your report AndrewID-HW1.pdf or AndrewID-HW1.docx.




Copyright 2016, Carnegie Mellon University.
Updated on March 31, 2017
Jamie Callan