Frequency and Co-occurrence

Text Analytics:
95-865 (A)

HW1: Frequency and Co-occurrence
Due Apr 4, 11:59pm

The purpose of this assignment is to gain hands-on experience with several ways of doing frequency and co-occurrence analysis using news and web data; and with an open-source text analysis application. It consists of four major parts:

Install the Sifaka text analysis application on your computer;
Use Sifaka to conduct three frequency and co-occurrence analysis experiments;
Use Google to conduct two co-occurrence analysis experiments; and
Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.

Data

The first three experiments give you experience using text analytics software to learn about a new corpus that you know nothing about. The documents are Wall Street Journal articles, but they cover a timespan that you may not know much about. Your task is to learn more about the topics covered by the corpus, and then to drill down to learn more about some of the people and companies that it covers.

This assignment uses a dataset of Wall Street Journal news documents from the late 1980's and early 1990's. The documents are provided in a zip file that contains a Lucene search engine index. The dataset is large (1.6 GB), so don't wait until the last minute to download it.

Software

Sifaka is open-source text analysis software developed by the Lemur Project that is available for Windows, Mac, and Linux operating systems. Sifaka supports a variety of search, frequency, co-occurrence, and feature vector exporting capabilities within a common GUI.

See the Sifaka Tutorial Links to learn how the software works. However, download this version of the Sifaka java archive (jar) file, because it is a slightly newer version. It does not matter where you store the software on your computer.

Note: Sifaka requires Java 8 to be installed on your computer. You can check your version of java by entering the command "java -version" into a commandline window.

Academic Integrity

This is not a group project. You must do this work on your own. You may discuss the project in general terms with other students; however, you may not share data or analyses of any kind.

See the academic integrity lecture notes for more information about what is and is not allowed.

Experiment 1: Topic Investigation

Use the Frequency tab to generate a list of the top 10,000 noun phrases, ranked by ctf and df (examine both). Identify ten topics that are covered by the Wall Street Journal. Typically a topic is supported by at least 5 phrases. Try to pick topics from different parts of the frequency spectrum, i.e., not just at the top of the list.

Repeat this analysis using one of the other entity types (person, location, or organization); you may choose the type that interests you most.

In your report, you will be asked to discuss:

The value of using a distribution of noun phrases to discover the topics covered by a corpus;
The value of using the other entity type you selected; and
The difference (if any) of using ctf or df for this task.

Experiment 2: People Investigation

The second experiment gathers information about people. You will investigate three different methods.

2.1: Associated Topics

Use the Frequency tab to generate a list of the most frequent 10,000 people, ranked by df or ctf (your choice, informed by Experiment 1). Select a person ("Person₁") that looks interesting to you; results will be better if the name contains at least two terms.

Hint: There may be a temptation to focus on famous people, companies, or events that you know about already, for example President Ronald Reagan. That seems easier, because you already know something about them. However, famous entities are associated with many events, people, and organizations, thus it will be more difficult for you (and the tools that you use) to identify important patterns. You are likely to have better results if you pick people and companies that are 'mid frequency'.

Many people discussed in the corpus were somewhat famous when the documents were written, but have become more famous as their careers have progressed. For example, Joe Biden was a Congressman from Delaware; now he is the Vice President of the United States. Investigating a person's earlier career is fine. However, if they were already very famous when they were young (e.g., Michael Jackson), it is probably better to pick someone else.

Search for documents about the Person₁. You can right click on the name in the Frequency tab, but probably results will be better if you use the Search tab and enclose the name in quotes (e.g., "Joe Biden"). Read a few documents to get a sense of what this person was known for.

Use the Co-occurrence tab to find noun-phrases that co-occur with Person₁. Use the name in a phrase query (i.e., use quotation marks around the name). Set the search depth to 100, the minimum frequency to 2, the number of results to 1000, and use PMI and term frequency.

Examine the results. Consider whether they reveal new questions or aspects of Person₁ that you did not encounter in the documents that you sampled.

Repeat this analysis for two other people (Person₂ - Person₃).

2.2: Associated People

Use the Co-occurrence tab to find people that co-occur with Person₁. Select a name from this list (Person_1,1). Use the Search tab to find information about the relationship between Person₁ and Person_1,1 (e.g., using a query such as "Clark Kent" AND "Lois Lane"). Read a few documents to learn something about the relationship between these two people.

Repeat this process for four other people associated with Person₁ (i.e., Person_1,2 - Person_1,5).

Do the same analysis for Person₂ and Person₃ that was done for Person₁.

2.3: Associated Organizations

Use the process developed for studying associated people to study associated organizations. Cover at least five organizations associated with each of Person₁ - Person₃.

Experiment 3: Organization Investigation

Repeat the three types of analysis done in Experiment 2 for three organizations (Organization₁ - Organization₃).

Experiment 4: Using a Search Engine to Calculate PMI

Use Google to collect frequency information and calculate PMI and phi-square values for the following pairs of entities.

Jamie Callan, Tyler Perry
Jamie Callan, Andrew Moore
Jamie Callan, Kevyn Collins-Thompson
K Callan, Tyler Perry
Tyler Perry, Oprah Winfrey

Use the number of matching documents provided by Google to calculate PMI and phi-square for each pair of entities. When calculating PMI, assume that the size of the English Web is 75 billion documents.

To help you debug your calculations, sample results are shown below.

Freq ("Bruce Wayne")	Freq ("Jason Todd")	Freq ("Bruce Wayne" AND "Jason Todd")	PMI	Phi-square
937,000	722,000	558,000	4.791418674	0.460242982

Don't worry if your Google frequencies are different from ours. Just make sure that your calculations of PMI and Phi-square are correct.

Note: These values were obtained using the log₁₀ function. Use log₁₀ instead of ln.

Experiment 5: People Investigation Using Google

For three of the people (Person_i) studied in Experiment 2, submit their name as a phrase query to Google (e.g., "Jamie Callan"). Select a document at random from the 1st, 4th, 8th, 12th, 16th, and 20th search result page (i.e., 6 documents that are sampled relatively uniformly from the top 200 documents). Select the names of 1-2 people from each sampled document to create a list of 5 names that are associated with your Person_i. Each of these individuals is the Person_i,j in a new experiment.

For each Person_i, submit the following three queries to Google.

"Person_i"
"Person_i,j"
"Person_i" AND "Person_i,j"

Use the number of matching documents provided by Google to calculate PMI and phi-square for the 5 people associated with each Person_i. When calculating PMI, assume that the size of the English Web is 75 billion documents.

What to Submit

You must submit a report. If you are unable to access Blackboard, you may submit your files to the TAs by email, however this option is intended as a last-resort method. Use Blackboard if at all possible.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format.

Name your report AndrewID-HW1.pdf or AndrewID-HW1.docx.

FAQ

There may be different entities that have the same name, for example, "Michael Jordan" the basketball player and "Michael Jordan" the machine learning professor. How should I handle this?
Typically this problem is ignored. PMI and phi-square help you understand whether two strings (e.g., "Michael Jordan", "Bugs Bunny") occur more often than you would expect from chance alone. It is the human analyst's job to explain why that co-occurrence is observed, or whether it is a meaningful or informative co-occurrence.

Having multiple entities with the same name can make significant co-occurrences more difficult to observe. For example, if one "Michael Jordan" entity is very common, then it is not surprising that the string "Michael Jordan" co-occurs with almost everything. Thus, it becomes a little harder to recognize important but low-frequency co-occurrences that involve one of the less famous "Michael Jordan" entities.
I can open an index with Sifaka and see index properties, but when I try to generate a frequency list, it takes forever (e.g., more than an hour). I have Java version 8 installed.
Perhaps the JVM is not allocating enough memory to the process. If you have a 32-bit version of Java, try upgrading to a 64-bit version of Java. Use Java commandline parameters to allocate at least 1 gigabyte of RAM to Sifaka. If your laptop has a lot of RAM, try allocating 2 gigabytes to Sifaka.

Jamie Callan

HW1: Frequency and Co-occurrence Due Apr 4, 11:59pm