Homework: Text Categorization

Text Analytics:
95-865 (A)

Homework #2: Text Categorization
Due Apr 18, 11:59pm

This assignment gives you hands-on experience with several ways of forming text representations, two popular categorization algorithms, and several common types of data. The datasets provide exposure to newswire, medical abstract, and social media content.

This assignment consists of four major parts:

Install Weka on your computer;
Use Sifaka to construct several text representations for each dataset;
Use Weka to do evaluate several different machine learning algorithms for each text representation; and
Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.

Software

In Homework 1 you installed Sifaka on your computer. For this assignment you need to install Weka. Weka is a popular open-source software suite for text and data mining that is available for Windows, Mac, and Linux operating systems. Weka supports a variety of categorization and clustering algorithms within a common GUI (and programmable API), which makes it extremely convenient.

Download Weka and install it on your computer. If you already have an older version of Weka that doesn't contain the LibLINEAR package, you will need to upgrade it for this assignment.

Read the Weka tutorial to familiarize yourself with using it to do text classification.

Test your installation as described below.

Navigate to Weka's Explorer application.
Load this sample data file (on the Preprocess tab, Open file).
Choose the Multinomial Naive Bayes classifier (on the Classify tab, Choose->bayes/NaiveBayesMultinomial).
Under test options, choose Cross-validation with 10 folds
In the drop-down menu, make sure the last feature "(Nom) class" is selected.
Click start

Verify that you got the expected results:

    Correctly Classified Instances    5141  85.6833 %
    Incorrectly Classified Instances   859  14.3167 %

Datasets

This assignment investigates two types of classification: i) topic classification, and ii) sentiment classification. Thus, there are two types of datasets, as described below. Each dataset is provided in a Lucene index that can be imported into Sifaka.

Topic Classification Datasets

Reuters: We use a subset of Reuters-21578, a well-known news dataset. The text and categories are similar to text and categories used in industry. This assignment uses 6 categories of varying size: heat, housing, coffee, gold, acq, and earn. Download reuters-19042.zip.
OHSUMED: We use a subset of OHSUMED, a well-known medical abstracts dataset. The text and categories are similar to text and categories used in medical research. This assignment uses 6 categories of varying size: Mitosis, Pediatrics, Necrosis, Hyperplasia, Pregnancy, Rats.csv. Download OHSUMED_10k.zip.

Report Precision, Recall, and F-measure for the topic-oriented categories only (i.e., do not report values for neg). Use all of the categories except neg for macro-averaging.

Sentiment Classification Datasets

We use two of Julian McAuley's Amazon product datasets.

Apps for Android: We use a subset of the Android Apps dataset. The subset contains 25,000 product reviews from Amazon. This dataset has three categories: Good (4-5 stars), Ok (3 stars), and Poor (1-2 stars). Download AppsForAndroid.zip.
Cell Phones and Accessories: We use a subset of the Cell Phones and Accessories dataset. The subset contains 27,777 product reviews from Amazon. This dataset has three categories: Good (4-5 stars), Ok (3 stars), and Poor (1-2 stars). Download CellPhonesAndAccessories.zip.

Report Precision, Recall, and F-measure for all three sentiment-oriented categories (i.e., good, ok, poor). Use all three categories for macro-averaging.

Experiments

This assignment can be viewed as creating a set of classifiers that decide whether or not document d should be assigned to category c. In these datasets, a person has already assigned each document to one or more categories ("ground truth" or "gold standard" labels). The machine learning software will use some of these documents as training data to learn a classifier; the remaining documents will be used to test the accuracy of the learned classifiers.

When you report experimental results, if your values are in the range 0-100, only provide 1 decimal point of precision in your measurements (e.g., 96.1); if your values are in the range 0-1, only provide 3 decimal points of precision in your measurements (e.g., 0.961). Greater accuracy is unnecessary for this task.

Note that when reporting Precision, Recall and F-measure for your experiments, the neg category is ignored for the topic-oriented datasets (reuters-19042 and OHSUMED_10k). However the poor category is used for sentiment-oriented datasets (AppsForAndroid and CellPhonesAndAccessories). Make sure that you report the correct Precision and Recall values.

Learning Algorithms

Conduct your experiments with the following two learning algorithms provided by Weka.

Multinomial Naive Bayes: Bayes/NaiveBayesMultinomial
A linear version of SVM: functions/LibLINEAR. This package is a version of SVM optimized for linear classification with text data. You could also use LibSVM with a linear kernel, but it is much slower.

Experiment #1: Topical Baselines

Create two baseline representations for each topical dataset. The baseline representations are defined as follows.

baseline 1: terms, frequency threshold=10, all features

If this baseline is too big for Weka to handle on your laptop, you can reduce its size, for example, by keeping just 500 features for each class. If you prune your baseline feature set, be sure to discuss this in your report.
baseline 2: terms, frequency threshold=10, just the top 10 features per class, as determined by kappa (Sifaka's default term score).

Export the baseline representations to arff files.

Test your baseline representations using Multinomial Naive Bayes and LibLINEAR (SVM). Report Precision, Recall and F-measure (for the positive categories) obtained with each baseline representation.

Experiment #2: Feature selection

Test the effects of different numbers of features in baseline #1. Try five different sizes (numbers of features). Test these new representations using the datasets and learning algorithms tested in experiment #1.

Experiment #3: Your Representations

Develop and test two custom representations of your own design for the topical datasets. You can try combining different choices that worked well in Experiments 1-2, and/or you can explore different Sifaka options that weren't explored in Experiments 1-2 (e.g., using noun phrases). Test these new representations using the datasets and algorithms tested in Experiment #2.

Experiment #4: Sentiment Baselines

Create two baseline representations and two custom representations for the AppsForAndroid and CellPhonesAndAccessories datasets. The two baseline representations should use the same parameter settings used for Experiment 1. Use your experience in Experiments 1-3 to select parameter settings for the custom representations. Test these new representations using both algorithms.

What to Turn In

You must submit a report that describes your work and your analysis of the experimental results in a written report. . If you are unable to access Blackboard, you may submit your files to the TAs by email, however this option is intended as a last-resort method. Use Blackboard if at all possible.

Your analysis is a significant part of the grade, so be sure to leave enough time to do a good job.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format.

The template provides specific instructions about what information to provide for each experiment. However, generally speaking, you should discuss any trends that you observe about what works well or doesn't; dataset-specific characteristics; or algorithm-specific characteristics. Discuss whether the different choices work as you expected, or whether there surprises. If things didn't work as you expected, what might be the causes?

Name your report AndrewID-HW2.pdf or AndrewID-HW1.docx.

Having Trouble? See the FAQ.

Jamie Callan

Homework #2: Text Categorization Due Apr 18, 11:59pm