Text Analytics:
95-865 (A)
CMU logo

Homework #2: Text Categorization
Due Apr 12, 11:59pm

This assignment gives you hands-on experience with several ways of forming text representations, two popular categorization algorithms, and several common types of data. The datasets provide exposure to newswire, medical abstract, and social media content.

This assignment consists of five major parts:

  1. Install LightSIDE on your computer;
  2. Install weka on your computer;
  3. Use LightSIDE to construct several text representations for each dataset;
  4. Use weka to do evaluate several different machine learning algorithms for each text representation; and
  5. Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.



You must install two software applications on your laptop.


LightSIDE is an open-source software suite for developing and testing text representations. It is available for Windows, Mac, and Linux operating systems. LightSIDE supports several common methods of forming text features (e.g., unigram, bigram, trigram, phrases, stemming). It also includes an integrated version of weka for testing text representations, however we won't be using that for this homework assignment.

Download LightSIDE and install it on your computer.

Read the LightSIDE Researcher's Manual that comes with the software to familiarize yourself with creating text representations for the sample datasets included with LightSIDE.



weka is a popular open-source software suite for text and data mining that is available for Windows, Mac, and Linux operating systems. Weka supports a variety of categorization and clustering algorithms within a common GUI (and programmable API), which makes it extremely convenient.

Download weka and install it on your computer.

Read the weka tutorial to familiarize yourself with using it to do text classification.

Test your installation as described below.

  1. Navigate to weka's Explorer application.
  2. Load this sample data file (on the Preprocess tab, Open file).
  3. Choose the Naive Bayes classifier. (on the Classify tab, Choose->bayes/NaiveBayes)
  4. Under test options, choose Cross-validation with 10 folds
  5. In the drop-down menu, make sure the last feature "(Nom) class" is selected.
  6. Click start
  7. Verify that you got the expected results:
        Correctly Classified Instances    4538  75.6333 %
        Incorrectly Classified Instances  1462  24.3667 %


This assignment investigates two types of classification: i) topic classification, and ii) sentiment classification. Thus, there are two types of datasets, as described below. Each dataset is provided in a CSV format that can be imported into LightSIDE.

Topic Classification Datasets

  1. Reuters: We use a subset of Reuters-21578, a well-known news dataset. The text and categories are similar to text and categories used in industry. This assignment uses 6 categories of varying size: heat, housing, coffee, gold, acq, and earn. Download reuters-allcat-6.zip. It contains 1 csv file that covers all 6 classes.

  2. OHSUMED: We use a subset of OHSUMED, a well-known medical abstracts dataset. The text and categories are similar to text and categories used in medical research. This assignment uses 6 categories of varying size: Mitosis, Pediatrics, Necrosis, Hyperplasia, Pregnancy, Rats.csv. Download ohsumed-allcats-6.zip. It contains 1 csv file that covers all 6 classes.

  3. Epinions topical datasets: Epinions.com is a website where people can post reviews of products and services. The text and categories are similar to text and categories used in social media. This assignment uses two topical Epinions datasets:

    1. Epinions-2: This assignment uses 2 categories of equal size: cars (3,000) and cameras (3,000). Download epinions-2.zip. It contains 1 csv file that covers both classes.
    2. Epinions-ford: A dataset that contains 6,000 posts about Ford cars (3,000) and other cars (3,000). Download epinions-ford.zip. It contains 1 csv file that covers both classes.

Sentiment Classification Datasets

  1. Movie: We use Pang and Lee's Movie Review Data. It contains 2,000 movie reviews from IMDB. The text is similar to movies reviews on IMDB today. It has two categories: Pos (1,000) and Neg (1,000); there are no neutral reviews. Download movie-pang02.zip.

  2. Epinions sentiment: The dataset contains 1,382 Epinions.com posts that express opinions about Ford automobiles. It has two categories: Pos (691) and Neg (691); there are no neutral reviews. Download epinions-likeford.zip.

  3. Twitter-sanders datasets: Twitter is a popular microblog service where people can post information and opinions on any topic. This assignment uses tweets about Apple corporation that were extracted from a Twitter dataset created by Sanders Analytics. There are two subsets.

    1. Twitter-apple2: A dataset that contains two categories: Pos (163 positive tweets) and Neg (316 tweets). Download twitter-sanders-apple2.zip.
    2. Twitter-apple3: A dataset that contains three categories: Pos (163 tweets), Neg (316 tweets), and Neutral (509). Download twitter-sanders-apple3.zip.



This assignment can be viewed as creating a set of classifiers that decide whether or not document d should be assigned to category c. In these datasets, a person has already assigned each document to one or more categories ("ground truth" or "gold standard" labels). The machine learning software will use some of these documents as training data to learn a classifier; the remaining documents will be used to test the accuracy of the learned classifiers.

When you report experimental results, if your values are in the range 0-100, only provide 1 decimal point of precision in your measurements (e.g., 96.1); if your values are in the range 0-1, only provide 3 decimal points of precision in your measurements (e.g., 0.961). Greater accuracy is unnecessary for this task.

In all of your experiments, be sure to report Precision, Recall and F-measure for the positive category (e.g., for the Reuters dataset, acq, coffee, earn, etc). For small categories, the classifier may be better at recognizing what is not in the category than what is in the category. Make sure that you are reporting the correct Precision and Recall values.

Learning Algorithms

Conduct your experiments with the following two learning algorithms provided by Weka.


This homework requires you to train a large number of classifiers. You must do Experiment #1 manually so that you gain some experience with Weka. However, you may use scripts (e.g., dndw_v1.zip) to automate Experiments #2 and #3, which will allow you to spend most of your time analyzing experimental results instead of running Weka. If you try to run all of the experiments manually, it will be tedious and time-consuming.

See the Weka Automation Workflow Guide and the Brief LightSide Tutorial for more information. Questions about the automation scripts should be posted on Piazza. While we have done testing on various computer setups before delivering this automation solution, it is still possible that something might go wrong. For this reason, we strongly recommend starting on this homework earlier so that any technical difficulties can be resolved early.

Experiment #1: Topical Baselines

Create two baseline representations for each topical dataset. The baseline representations are defined as follows.

  1. baseline 1: unigrams, binary features (the default), frequency threshold=10.

    If this baseline is too big for weka to handle on your laptop, you can reduce its size, for example, by pruning out any feature with a kappa value of 0.001 or less. If you prune your baseline feature set, be sure to discuss this in your report.

  2. baseline 2: unigrams, binary features, frequency threshold=10, just the top 10 features per class, as determined by kappa. See Chapter 5 (Data Restructuring) in the Lightside Researcher's Manual to learn about filtering features.

Export the baseline representations to arff files.

Test your baseline representations using Naive Bayes and SVM. Report Precision, Recall and F-measure (for the positive categories) obtained with each baseline representation.

Experiment #2: Feature selection

Test the effects of different numbers of features in baseline #1. Try five different sizes (numbers of features). Test these new representations using the datasets and learning algorithms tested in experiment #1.

Experiment #3: Your Representations

Develop and test two custom representations of your own design for the topical datasets. You can try combining different choices that worked well in experiments 1-2, and/or you can explore different LightSIDE options that weren't explored in experiments 1-2. Test these new representations using the datasets and algorithms tested in experiment #2.

Experiment #4: Sentiment Baselines

Create two baseline representations and one custom representation for the movie, epinions, and twitter-sanders-apple2 datasets. The representations are defined as follows. Note that these are the same baseline settings that you used for Experiment #1, except that in this experiment you have a lower threshold (because the datasets are smaller).

  1. baseline 1: unigrams, binary features, threshold=3.

  2. baseline 2: unigrams, binary features, threshold=3, just the top 40 features as determined by kappa.

  3. custom baseline: define your own representation, based on your experience with Experiments 1-3. You may decide how it is created, how many features it contains, etc. You may make different choices for each dataset, if you wish. This will be your baseline representation in the rest of your experiments below.

Test your baseline representations for each category (Pos, Neg) using the default configurations for Bayes/NaiveBayes and LibSVM with the linear kernel. Report Precision, Recall, and F1 for each category (e.g., Pos and Neg) in each baseline representation.

Experiment #5: Two Classes vs. Three Classes

Apply your custom representation (Experiment #4) to the twitter-sanders-apple3 dataset. Test it using the Naive Bayes and SVM learning algorithms.


What to Turn In

Use Blackboard to submit a single zip file named AndrewId-HW2.zip. The zip file must contain the two files described below.


You must describe your work and your analysis of the experimental results in a written report. Your analysis is a significant part of the grade, so be sure to leave enough time to do a good job.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format. Name your report AndrewID-HW2.pdf.

The template provides specific instructions about what information to provide for each experiment. However, generally speaking, you should discuss any trends that you observe about what works well or doesn't; database-specific characteristics; or algorithm-specific characteristics. Discuss whether the different choices work as you expected, or whether there surprises. If things didn't work as you expected, what might be the causes?

CSV File

Through the use of the DNDW script you will have generated a "results.csv" file that logs your experiment results. It can be found in the same directory as the DNDW script. Rename the file AndrewId.csv before you turn it in.




Copyright 2015, Carnegie Mellon University.
Updated on April 07, 2016
Jamie Callan