The purpose of this assignment is to gain hands-on experience with several ways of doing frequency and co-occurrence analysis using news and web data; and with an open-source text analysis application. It consists of four major parts:
The report is an important part of your grade. Leave enough time to do a good job on it.
The first three experiments give you experience using text analytics software to learn about a new corpus that you know nothing about. The documents are Wall Street Journal articles, but they cover a timespan that you may not know much about. Your task is to learn more about the topics covered by the corpus, and then to drill down to learn more about some of the people and companies that it covers.
This assignment uses a dataset of Wall Street Journal news documents from the late 1980's and early 1990's. The documents are provided in a zip file that contains a Lucene search engine index. The dataset is large (1.6 GB), so don't wait until the last minute to download it.
Sifaka is open-source text analysis software developed by the Lemur Project that is available for Windows, Mac, and Linux operating systems. Sifaka supports a variety of search, frequency, co-occurrence, and feature vector exporting capabilities within a common GUI.
See the Sifaka Tutorial Links to learn how the software works. However, download this version of the Sifaka java archive (jar) file, because it is a slightly newer version. It does not matter where you store the software on your computer.
Note: Sifaka requires Java 8 to be installed on your computer. You can check your version of java by entering the command "java -version" into a commandline window.
This is not a group project. You must do this work on your own. You may discuss the project in general terms with other students; however, you may not share data or analyses of any kind.
See the academic integrity lecture notes for more information about what is and is not allowed.
Use the Frequency tab to generate a list of the top 10,000 noun phrases, ranked by ctf and df (examine both). Identify ten topics that are covered by the Wall Street Journal. Typically a topic is supported by at least 5 phrases. Try to pick topics from different parts of the frequency spectrum, i.e., not just at the top of the list.
Repeat this analysis using one of the other entity types (person, location, or organization); you may choose the type that interests you most.
In your report, you will be asked to discuss:
The second experiment gathers information about people. You will investigate three different methods.
Use the Frequency tab to generate a list of the most frequent 10,000 people, ranked by df or ctf (your choice, informed by Experiment 1). Select a person ("Person1") that looks interesting to you; results will be better if the name contains at least two terms.
Hint: There may be a temptation to focus on famous people, companies, or events that you know about already, for example President Ronald Reagan. That seems easier, because you already know something about them. However, famous entities are associated with many events, people, and organizations, thus it will be more difficult for you (and the tools that you use) to identify important patterns. You are likely to have better results if you pick people and companies that are 'mid frequency'.
Many people discussed in the corpus were somewhat famous when the documents were written, but have become more famous as their careers have progressed. For example, Joe Biden was a Congressman from Delaware; now he is the Vice President of the United States. Investigating a person's earlier career is fine. However, if they were already very famous when they were young (e.g., Michael Jackson), it is probably better to pick someone else.
Search for documents about the Person1. You can right click on the name in the Frequency tab, but probably results will be better if you use the Search tab and enclose the name in quotes (e.g., "Joe Biden"). Read a few documents to get a sense of what this person was known for.
Use the Co-occurrence tab to find noun-phrases that co-occur with Person1. Use the name in a phrase query (i.e., use quotation marks around the name). Set the search depth to 100, the minimum frequency to 2, the number of results to 1000, and use PMI and term frequency.
Examine the results. Consider whether they reveal new questions or aspects of Person1 that you did not encounter in the documents that you sampled.
Repeat this analysis for two other people (Person2 - Person3).
Use the Co-occurrence tab to find people that co-occur with Person1. Select a name from this list (Person1,1). Use the Search tab to find information about the relationship between Person1 and Person1,1 (e.g., using a query such as "Clark Kent" AND "Lois Lane"). Read a few documents to learn something about the relationship between these two people.
Repeat this process for four other people associated with Person1 (i.e., Person1,2 - Person1,5).
Do the same analysis for Person2 and Person3 that was done for Person1.
Use the process developed for studying associated people to study associated organizations. Cover at least five organizations associated with each of Person1 - Person3.
Repeat the three types of analysis done in Experiment 2 for three organizations (Organization1 - Organization3).
Use Google to collect frequency information and calculate PMI and phi-square values for the following pairs of entities.
Use the number of matching documents provided by Google to calculate PMI and phi-square for each pair of entities. When calculating PMI, assume that the size of the English Web is 75 billion documents.
To help you debug your calculations, sample results are shown below.
Freq ("Bruce Wayne") | Freq ("Jason Todd") | Freq ("Bruce Wayne" AND "Jason Todd") | PMI | Phi-square |
937,000 | 722,000 | 558,000 | 4.791418674 | 0.460242982 |
Don't worry if your Google frequencies are different from ours. Just make sure that your calculations of PMI and Phi-square are correct.
Note: These values were obtained using the log10 function. Use log10 instead of ln.
For three of the people (Personi) studied in Experiment 2, submit their name as a phrase query to Google (e.g., "Jamie Callan"). Select a document at random from the 1st, 4th, 8th, 12th, 16th, and 20th search result page (i.e., 6 documents that are sampled relatively uniformly from the top 200 documents). Select the names of 1-2 people from each sampled document to create a list of 5 names that are associated with your Personi. Each of these individuals is the Personi,j in a new experiment.
For each Personi, submit the following three queries to Google.
Use the number of matching documents provided by Google to calculate PMI and phi-square for the 5 people associated with each Personi. When calculating PMI, assume that the size of the English Web is 75 billion documents.
You must submit a report. If you are unable to access Blackboard, you may submit your files to the TAs by email, however this option is intended as a last-resort method. Use Blackboard if at all possible.
A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format.
Name your report AndrewID-HW1.pdf or AndrewID-HW1.docx.
There may be different entities that have the same name,
for example, "Michael
Jordan" the basketball player and
"Michael Jordan" the
machine learning professor. How should I handle this?
Typically this problem is ignored. PMI and phi-square help you
understand whether two strings (e.g., "Michael Jordan", "Bugs Bunny")
occur more often than you would expect from chance alone. It is
the human analyst's job to explain why that co-occurrence is
observed, or whether it is a meaningful or informative co-occurrence.
Having multiple entities with the same name can make significant co-occurrences more difficult to observe. For example, if one "Michael Jordan" entity is very common, then it is not surprising that the string "Michael Jordan" co-occurs with almost everything. Thus, it becomes a little harder to recognize important but low-frequency co-occurrences that involve one of the less famous "Michael Jordan" entities.
I can open an index with Sifaka and see index properties, but when
I try to generate a frequency list, it takes forever (e.g., more than
an hour). I have Java version 8 installed.
Perhaps the JVM is not allocating enough memory to the process. If
you have a 32-bit version of Java, try upgrading to a 64-bit version
of Java. Use Java commandline parameters to allocate at least 1
gigabyte of RAM to Sifaka. If your laptop has a lot of RAM, try
allocating 2 gigabytes to Sifaka.