Frequency and Co-occurrence

Text Analytics:
95-865 (K)

HW1: Frequency and Co-occurrence
Due Mar 18, 11:59pm (Adelaide timezone)

This assignment gives you hands-on experience with several ways of doing frequency and co-occurrence analysis using web data. It consists of three major parts:

Obtain frequency information from Google about a variety of topics and entities;
Use the frequency information and a spreadsheet (e.g., Excel) to calculate co-occurrence metrics for pairs of entities (entity_i, entity_j) or topics and entities (topic_i, entity_j);
Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.

Introduction

This assignment allows you to investigate some of the frequency and co-occurrence techniques presented in the third lecture. You will use a commercial search engine for this assignment instead of your own software, which has advantages and disadvantages. You do not need to do any software installation or programming, and you use information from a massive search engine, which are both advantages. However, the search engine offers only simple search capabilities, so it will be imprecise for some queries, and you can only do simple analyses.

Entities

If you were using your own search engine for this assignment, you might annotate documents using a named entity tagger (as discussed in class), and then use the search engine's query language to find documents that contain a specific named entity annotation (e.g., shell.company or COMPANY::shell). However, for this assignment you are using a search engine that does not provide this feature, so you must use the phrase operator instead. For example, to find documents about your professor, use the query "Jamie Callan" (the double quotes are part of the query). This will be reasonably effective for entities containing 2 or more words (e.g., "Jamie Callan"), but less effective for entities containing just 1 word (e.g., "shell").

Frequency Counts

Obtain frequency counts from Google's Australian search engine (google.com.au). You should have all forms of personalization deactivated. For example, don't use Chrome, don't be logged into Gmail or other Google services, use private browsing or incognito mode if your browser provides them, etc. You should be as anonymous as possible so that you and the TA get the same results from Google.

For example, Google reports that about 31,800 documents match the phrase query "Jamie Callan" (note that the double quote marks are part of the query). These counts aren't very accurate, but they will be sufficient for this assignment.

Google often returns different frequency counts for entity pairs depending upon their order. For example, Google reports that about 179,000 documents match the query "Rudy Giuliani", "Bill De Blasio", but 142,000 documents match the query "Bill De Blasio", "Rudy Giuliani". Obtain counts for both orders, and use the minimum value in your calculations.

Contingency Tables

The pointwise mutual information (PMI) and phi-square metrics require a contingency table of the following form.

entity_i entity_i

entity_j a b

entity_j c d

	entity_i	entity_i
entity_j	a	b
entity_j	c	d

As discussed in class, you can fill the contingency table by running 3 queries:

"entity_i" AND "entity_j"
"entity_i"
"entity_j"

The number of results for the first query gives you the value for a. The number of results for the second query gives you the value for a+c. The number of results for the third query gives you the value for a+b.

The size of the corpus is a + b + c + d. Assume that Google's search engine contains about 30 billion text documents (this is a guess, but it won't affect your results much). Assume that about 27% of the documents on the Internet are written in English.¹ Thus, the size of the corpus is 8.1 billion documents.

Some terms, for example 'Facebook' and 'Google' are popular in multiple languages, so they can occur more than 8.1 billion times, which will produce incorrect PMI values. You can avoid this problem by using 25 billion as your estimate of the corpus size for Experiment #2. Be sure to explain what corpus size you used.

Experiments

Run the experiments described below. These experiments will be relatively simple to accomplish if you start by creating a spreadsheet that i) converts the query results into contingency table values, and ii) uses the contingency table values to calculate PMI and phi-square. Check your spreadsheet! You should get the following values:

Freq ("Clark Kent")	Freq ("Lois Lane")	Freq ("Lois Lane" AND "Clark Kent")	PMI	Phi-square
4,190,000	3,260,000	1,250,000	2.869963	0.114187

Note: These values were obtained using the log function. Use log instead of ln.

Experiment #1: Baseline 1

Present a table that contains frequency, PMI, and phi-square values for the following pairs of entities.

Jamie Callan, Grace Hui Yang
K Callan, Bruce Croft
Jamie Callan, Dean Cain
Teri Hatcher, Dean Cain
Cylvia Hayes, Margi Hoffmann

Experiment #2: Baseline 2

This experiment is similar to the first experiment, except that in this experiment you replace "entity_j" with "topic_j". The topic is "mobile advertising".

The goal of this experiment is to assess the strength of the association between the topic mobile advertising and five organizations that are associated with mobile advertising. Present a table that contains frequency, PMI, and phi-square values for the following entities.

Google
Facebook
Madvertise Media
Sizmek
Aditic

Experiment #3: Assessing Relationships Between People

Select three individuals of varying prominence on the web. Don't choose celebrities, politicians, and other very famous individuals, and don't choose CMU faculty. Instead, choose people of moderate fame that interest you, for example, executives of small or medium-sized companies, less prominent local government officials, etc. Each individual must generate at least 200 search results (typically 20 search result pages) when run as a phrase query (e.g., "Jamie Callan"). Each of these individuals will be "entity_i" in an experiment.

For each "entity_i" individual that you selected, submit their name as a phrase query to Google (e.g., "Jamie Callan"). Select a document at random from the 1st, 4th, 8th, 12th, 16th, and 20th search result page (i.e., 6 documents that are sampled relatively uniformly from the top 200 documents). Select the names of 1-2 people from each sampled document to create a list of 10 names that are associated with your "entity_i" individual. Each of these individuals is the "entity_j" individual in an experiment.

Use the experimental methodology from your first baseline experiment to compute the frequency, PMI, and phi-square for the 10 people associated with your "entity_i" individual. Present the results in a table.

Repeat this process for each "entity_i" individual. The result is three tables.

Experiment #4: Assessing Relationships Between Topics and Companies

Select three business-related topics of varying prominence on the web. You may have more success if your topics are specific instead of general. Each topic must generate at least 200 search results. Each of these topics will be "topic_i" in an experiment.

For each "topic_i" that you selected, submit it as a phrase query to Google. Select a document at random from the 1st, 4th, 8th, 12th, 16th, and 20th search result page (i.e., 6 documents that are sampled relatively uniformly from the top 200 documents). Select the names of 1-2 companies from each sampled document to create a list of 10 names that are associated with your "topic_i". Each of these companies is the "entity_j" in an experiment.

Use the experimental methodology from your second baseline experiment to compute the frequency, PMI, and phi-square for the 10 companies associated with your "topic_i". Present the results in a table.

Repeat this process for each "topic_i". The result is three tables.

Experiment #5: An Experiment of Your Own Design

Design your own experiment that uses frequency, PMI, and phi-square to investigate the relationships between entities and topics, or entities and other entities. Now that you are familiar with these techniques, how would you use them for a business-related task? Remember to select individuals, organizations, or topics of moderate prominence on the web. Don't choose celebrities, very famous companies (e.g., Twitter), or very general topics; and don't choose CMU faculty.

What to Submit

You must submit a report and a .csv file via Blackboard before the deadline, as described below. If you are unable to access Blackboard, you may submit your files to the TA by email, however this option is intended as a last-resort method. Use Blackboard if at all possible.

The Report: You must describe your work and your analysis of the experimental results in a written report. Your analysis is a significant part of the grade, so be sure to leave enough time to do a good job.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format.

Name your report AndrewID-HW1.pdf or AndrewID-HW1.docx.
The CSV File: The CSV file should contain the results of experiment 1. Use this template and replace the 0 values with relevant results. The goal of this file is to verify that your calculations are correct. Try to be careful with the formatting and include all the precision excel will provide.

Name your CSV file AndrewID-HW1.csv.
The XLSX File: The XLSX file should contain the results of all your experiments. Use this template as a starting point. The TA has provided this template as a time-saving device. It generates click-able HTML links that contain the search query. All these queries will open in your default OS browser. As described above, you might want to make yourself anonymous before conducting the searches to ensure consistent results. While we ask that you turn in this file, you are free to modify this excel file to your needs.

Name your XLSX file AndrewID-HW1.xlsx.

FAQ

There may be different entities that have the same name, for example, "Michael Jordan" the basketball player and "Michael Jordan" the machine learning professor. How should I handle this?
Typically this problem is ignored. PMI and phi-square help you understand whether two strings (e.g., "Michael Jordan", "Bugs Bunny") occur more often than you would expect from chance alone. It is the human analyst's job to explain why that co-occurrence is observed, or whether it is a meaningful or informative co-occurrence.

Having multiple entities with the same name can make significant co-occurrences more difficult to observe. For example, if one "Michael Jordan" entity is very common, then it is not surprising that the string "Michael Jordan" co-occurs with almost everything. Thus, it becomes a little harder to recognize important but low-frequency co-occurrences that involve one of the less famous "Michael Jordan" entities.

Jamie Callan

HW1: Frequency and Co-occurrence Due Mar 18, 11:59pm (Adelaide timezone)