This assignment gives you hands-on experience with several ways of forming text representations, three common types of opinionated text data, and the use of text categorization for sentiment analysis. The three datasets provide experience with different types of social media content. This assignment also gives you practice with a type of question that you will see on the final exam.
This assignment consists of four major parts:
Use the same installations of Lightside and Weka that you used for HW2. No changes are necessary. You should have fewer memory and running time problems with this homework assignment the datasets are smaller and have fewer categories.
You will be working with preprocessed forms of three datasets, as described below. Each dataset is provided in a CSV format that can be imported into LightSIDE.
Pang and Lee's Movie Review Data was one of the first widely-available sentiment analysis datasets. It contains 1,000 positive and 1,000 negative movie reviews from IMDB, so it is now considered too small for serious research and development purposes. However, the text is similar to movies reviews on IMDB today.
The file movie-pang02.zip contains a copy of Pang and Lee's Movie Review Data in a csv format that can be imported directly in LightSIDE. It has two categories: Pos (reviews that express a positive or favorable sentiment) and Neg (reviews that express a negative or unfavorable sentiment). For this assignment, we will assume that all reviews are either positive or negative; there are no neutral reviews.
Epinions.com is a website where people can post reviews of products and services. It covers a wide variety of topics. For this assignment, we downloaded a set of 691 posts that recommend Ford automobiles, and 691 posts that do not recommend Ford automobiles. Although the text is several years old, it is similar to comments found on Epinions.com today.
The file epinions3.zip contains the 1,382 posts in a csv format that can be imported directly in LightSIDE. It has two categories: Pos (reviews that express a positive or favorable sentiment) and Neg (reviews that express a negative or unfavorable sentiment). For this assignment, we will assume that all reviews are either positive or negative; there are no neutral reviews.
Twitter is a popular microblog service where people can post information and opinions on any topic. For this assignment, we will use tweets about Apple corporation that were extracted from a Twitter dataset created by Sanders Analytics. There are two subsets.
The file twitter-sanders-apple2.zip contains 479 tweets in a csv format that can be imported directly in LightSIDE. It has two categories: Pos (163 tweets that express a positive or favorable sentiment) and Neg (316 tweets that express a negative or unfavorable sentiment).
The file twitter-sanders-apple3.zip contains 988 tweets in a csv format that can be imported directly in LightSIDE. It has three categories: Pos (163 tweets that express a positive or favorable sentiment), Neg (316 tweets that express a negative or unfavorable sentiment), and Neutral (509 tweets that do not express a sentiment).
Create two baseline representations and one custom representation for the movie, epinions, and twitter-sanders-apple2 datasets. The representations are defined as follows. Note that these are the same baseline settings that you used for experiment #1 in HW2, except that in this experiment you have a lower threshold (because the datasets are smaller).
baseline 1: unigrams, binary features, threshold=3.
baseline 2: unigrams, binary features, threshold=3, just the top 40 features as determined by kappa.
custom baseline: define your own representation, based on your experience with HW2. You may decide how it is created, how many features it contains, etc. You may make different choices for each dataset, if you wish. This will be your baseline representation in the rest of your experiments below.
Export the baseline representations to arff files.
Test your baseline representations for each category (Pos, Neg) using the default configurations for Bayes/NaiveBayes and LibSVM with the linear kernel. Report Precision, Recall, and F1 for each category (e.g., Pos and Neg) in each baseline representation.
Explain your custom representation, and why you chose it. Discuss the differences among the three baselines and the different algorithms. Pay particular attention to differences in accuracy, and what may have caused them.
LightSide has limited capabilitis for creating phrase features. This experiment explores those capabilities by creating two new representations.
First, add bigram features to your custom representation from Experiment #1. In Lightside you will check the "Unigrams" and "Bigrams" boxes in the "Configure Basic Features" pane.
Second, create a new representation using words tagged with their parts of speech. In LightSide you will check the "Word/POS Pairs" box in the "Configure Basic Features" pane. You will not check the "Unigrams" box.
Test these two representations using the datasets (movie, epinions, twitter) and algorithms (NB, SVM) tested in experiment #1.
Discuss your results. Do "phrase" features help? Is there any difference in accuracy and/or the number of features generated? Does each dataset and/or learning algorithm behave the same way?
Apply your custom representation to the twitter-sanders-apple3 dataset. Test it using the Naive Bayes and SVM learning algorithms.
Compare the results of this 3-class version of the Twitter Apple data to the 2-class version of the Twitter Apple data. Discuss your results. How does the addition of an extra class affect the learning algorithms? Which pairs of classes are most often confused? Why might that be?
(FYI, on an exam I would expect students to spend 15-20 minutes on this question.)
Suppose that you work for Virgin Mobile Australia. The company is able to monitor the web browsing behavior of its mobile phone customers. Your job is to use this information to build a profile of each customer's interests across about 100 broad topic categories that Virgin Mobile can use for marketing and advertising purposes. You decide that your topic categories will be the top two levels of the Open Directory Project (DMOZ). Describe how you would accomplish this task. Be clear about what data is used, how you obtain the data, how the data is used to create a profile of each customer, and what a customer profile looks like. Your solution must be scalable to a large population of customers.
You must describe your work and your analysis of the experimental results in a written report. Your analysis is a significant part of the grade, so be sure to leave enough time to do a good job.
A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format. Name your report AndrewID-HW3.pdf.
The template provides specific instructions about what information to provide for each experiment. However, generally speaking, you should discuss any trends that you observe about what works well or doesn't; database-specific characteristics; or algorithm-specific characteristics. Discuss whether the different choices work as you expected, or whether there surprises. If things didn't work as you expected, what might be the causes?
Submit your report by Blackboard before the deadline.
Through the use of the DNDW script you will have generated a "results.csv" file that logs your experiment results.
Remember to replace your Homework2 results.csv with a new copy from here.
It can be found in the same directory as the DNDW script. Turn this file in on Blackboard.
None yet...