Text Analytics:
95-865 (K)
CMU logo

Homework #2: Text Categorization
Due Apr 1, 9:59pm (Adelaide time)

This assignment gives you hands-on experience with several ways of forming text representations, several popular categorization algorithms, and three common types of data. The three datasets provide exposure to newswire, medical abstract, and social media content.

This assignment consists of five major parts:

  1. Install LightSIDE on your computer;
  2. Install weka on your computer;
  3. Use LightSIDE to construct several text representations for each dataset;
  4. Use weka to do evaluate several different machine learning algorithms for each text representation; and
  5. Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.



LightSIDE is an open-source software suite for developing and testing text representations. It is available for Windows, Mac, and Linux operating systems. LightSIDE supports several common methods of forming text features (e.g., unigram, bigram, trigram, phrases, stemming). It also includes an integrated version of weka for testing text representations, however we won't be using that for this homework assignment.

Download LightSIDE and install it on your computer.

Read the LightSIDE Researcher's Manual that comes with the software to familiarize yourself with creating text representations for the sample datasets included with LightSIDE.



weka is a popular open-source software suite for text and data mining that is available for Windows, Mac, and Linux operating systems. Weka supports a variety of categorization and clustering algorithms within a common GUI (and programmable API), which makes it extremely convenient.

Download weka and install it on your computer.

Read the weka tutorial to familiarize yourself with using it to do text classification.

Test your installation as described below.

  1. Navigate to weka's Explorer application.
  2. Load this sample data file (on the Preprocess tab, Open file).
  3. Choose the Naive Bayes classifier. (on the Classify tab, Choose->bayes/NaiveBayes)
  4. Under test options, choose Cross-validation with 10 folds
  5. In the drop-down menu, make sure the last feature "(Nom) class" is selected.
  6. Click start
  7. Verify that you got the expected results:
        Correctly Classified Instances    4538  75.6333 %
        Incorrectly Classified Instances  1462  24.3667 %


You will be working with preprocessed forms of three datasets, as described below. Each dataset is provided in a CSV format that can be imported into LightSIDE.


Reuters-21578 is a well-known newswire dataset. It contains 21,578 newswire documents, so it is now considered too small for serious research and development purposes. However, the text and categories are similar to text and categories used in industry.

There are many Reuters categories. In this assignment you will consider 6 categories. Two are 'big' categories (many positive documents), two are 'medium' categories, and two are 'small' categories (few positive documents).

Download reuters-allcat-6.zip. It contains 1 csv file that covers all 6 classes.


OHSUMED is a well-known medical abstracts dataset. It contains 348,566 references, and is still used for research and development.

There are many OHSUMED categories. In this assignment you will consider 6 categories. Two are 'big' categories (many positive documents), two are 'medium' categories, and two are 'small' categories (few positive documents).

Download ohsumed-allcats-6.zip. It contains 1 csv file that covers all 6 classes.


Epinions.com is a website where people can post reviews of products and services. It covers a wide variety of topics. For this homework assignment, we downloaded a set of 12,000 posts about digital cameras and cars.

epinions.zip contains a directory with two csv files


Learning Algorithms

Conduct your experiments with the following three learning algorithms provided by Weka.


This homework requires you to train a large number of classifiers. You must do Experiment #1 manually so that you gain some experience with Weka. However, you may use scripts (e.g., dndw_v1.zip) to automate Homeworks #2 and #3, which will allow you to spend most of your time analyzing experimental results instead of running Weka. If you try to run all of the experiments manually, it will be tedious and time-consuming.

See the Weka Automation Workflow Guide and the Brief LightSide Tutorial for more information. Questions about the automation scripts should be posted on Piazza. While we have done testing on various computer setups before delivering this automation solution, it is still possible that something might go wrong. For this reason, we strongly recommend starting on this homework earlier so that any technical difficulties can be resolved early.


This assignment can be viewed as creating a set of classifiers that decide whether or not document d should be assigned to category c. Each classifier is solving a two-class problem, where the classes are "positive" (assign to category) and "negative" (don't assign to category). This is typical for problems that truly are two-category problems (e.g., the epinions data) and problems where multiple categories can be assigned to each document (e.g., the Reuters and OHSUMED data).

In these datasets, a person has already assigned each document to one or more categories ("ground truth" or "gold standard" labels). The machine learning software will use some of these documents as training data to learn a classifier; the remaining documents will be used to test the accuracy of the learned classifiers.

The first experiment creates a set of baseline classifiers for three datasets. The second experiment tests the effect of varying the numbers of features. The third experiment allows you to test your own ideas about how to form features. Each experiment is described in more detail, below.

When you report experimental results, if your values are in the range 0-100, only provide 1 decimal point of precision in your measurements (e.g., 96.1); if your values are in the range 0-1, only provide 3 decimal points of precision in your measurements (e.g., 0.961). Greater accuracy is unnecessary for this task.

In all of your experiments, be sure to report Precision, Recall and F-measure for the positive category (e.g., for the Reuters dataset, acq, coffee, earn, etc). For small categories, the classifier may be better at recognizing what is not in the category than what is in the category. Make sure that you are reporting the correct Precision and Recall values.

Experiment #1: Baselines

Create two baseline representations for each dataset. The baseline representations are defined as follows.

  1. baseline 1: unigrams, binary features (the default), frequency threshold=10.

    If this baseline is too big for weka to handle on your laptop, you can reduce its size, for example, by pruning out any feature with a kappa value of 0.001 or less. If you prune your baseline feature set, be sure to discuss this in your report.

  2. baseline 2: unigrams, binary features, frequency threshold=10, just the top 10 features per class, as determined by kappa. See Chapter 5 (Data Restructuring) in the Lightside Researcher's Manual to learn about filtering features.

Export the baseline representations to arff files.

Test your baseline representations using J48, Naive Bayes, and SVM. Report Precision, Recall and F-measure (for the positive categories) obtained with each baseline representation. Also report the average time required to build models in weka for each dataset for each baseline representation. Measurements of running time do not need to be super precise. The goal is for you to notice the differences in running times for the different algorithms.

Discuss the differences among the two baselines and the different algorithms. Pay particular attention to differences in accuracy and efficiency, and what may have caused them.

Experiment #2: Feature selection

Test the effects of different numbers of features in baseline #1. Try five different sizes (numbers of features). Test these new representations using the datasets and learning algorithms tested in experiment #1. Are small representations more effective, or are large representations more effective? Does each dataset and/or learning algorithm behave the same way? Report your results in a tabular format. Discuss your results. Does having more features make a difference? If so, is it an important difference? Are the effects similar for small and large classes?

Experiment #3: Your Representations

Develop and test two custom representations of your own design. You can try combining different choices that worked well in experiments 1-2, and/or you can explore different LightSIDE options that weren't explored in experiments 1-2. Test these new representations using the datasets and algorithms tested in experiment #2. Discuss your reasons for developing each representation, and how well they worked.



You must describe your work and your analysis of the experimental results in a written report. Your analysis is a significant part of the grade, so be sure to leave enough time to do a good job.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format. Name your report AndrewID-HW2.pdf.

The template provides specific instructions about what information to provide for each experiment. However, generally speaking, you should discuss any trends that you observe about what works well or doesn't; database-specific characteristics; or algorithm-specific characteristics. Discuss whether the different choices work as you expected, or whether there surprises. If things didn't work as you expected, what might be the causes?

CSV File

Through the use of the DNDW script you will have generated a "results.csv" file that logs your experiment results. It can be found in the same directory as the DNDW script. Turn this file in on Blackboard.


Submit your report and CSV file by Blackboard before the deadline.




Copyright 2015, Carnegie Mellon University.
Updated on March 08, 2016
Jamie Callan