Homework: Text Categorization

Text Analytics:
95-865 (K)

Homework #2: Text Categorization
Due Apr 1, 9:59pm (Adelaide time)

This assignment gives you hands-on experience with several ways of forming text representations, several popular categorization algorithms, and three common types of data. The three datasets provide exposure to newswire, medical abstract, and social media content.

This assignment consists of five major parts:

Install LightSIDE on your computer;
Install weka on your computer;
Use LightSIDE to construct several text representations for each dataset;
Use weka to do evaluate several different machine learning algorithms for each text representation; and
Write a report that discusses your findings and experience with this assignment.

The report is an important part of your grade. Leave enough time to do a good job on it.

LightSIDE

LightSIDE is an open-source software suite for developing and testing text representations. It is available for Windows, Mac, and Linux operating systems. LightSIDE supports several common methods of forming text features (e.g., unigram, bigram, trigram, phrases, stemming). It also includes an integrated version of weka for testing text representations, however we won't be using that for this homework assignment.

Download LightSIDE and install it on your computer.

Read the LightSIDE Researcher's Manual that comes with the software to familiarize yourself with creating text representations for the sample datasets included with LightSIDE.

Weka

weka is a popular open-source software suite for text and data mining that is available for Windows, Mac, and Linux operating systems. Weka supports a variety of categorization and clustering algorithms within a common GUI (and programmable API), which makes it extremely convenient.

Download weka and install it on your computer.

Read the weka tutorial to familiarize yourself with using it to do text classification.

Test your installation as described below.

Navigate to weka's Explorer application.
Load this sample data file (on the Preprocess tab, Open file).
Choose the Naive Bayes classifier. (on the Classify tab, Choose->bayes/NaiveBayes)
Under test options, choose Cross-validation with 10 folds
In the drop-down menu, make sure the last feature "(Nom) class" is selected.
Click start

Verify that you got the expected results:

    Correctly Classified Instances    4538  75.6333 %
    Incorrectly Classified Instances  1462  24.3667 %

Datasets

You will be working with preprocessed forms of three datasets, as described below. Each dataset is provided in a CSV format that can be imported into LightSIDE.

Reuters-21578

Reuters-21578 is a well-known newswire dataset. It contains 21,578 newswire documents, so it is now considered too small for serious research and development purposes. However, the text and categories are similar to text and categories used in industry.

There are many Reuters categories. In this assignment you will consider 6 categories. Two are 'big' categories (many positive documents), two are 'medium' categories, and two are 'small' categories (few positive documents).

Small categories: heat.csv, housing.csv
Medium categories: coffee.csv, gold.csv
Big categories: acq.csv, earn.csv

Download reuters-allcat-6.zip. It contains 1 csv file that covers all 6 classes.

OHSUMED

OHSUMED is a well-known medical abstracts dataset. It contains 348,566 references, and is still used for research and development.

There are many OHSUMED categories. In this assignment you will consider 6 categories. Two are 'big' categories (many positive documents), two are 'medium' categories, and two are 'small' categories (few positive documents).

Small categories: Mitosis.csv, Pediatrics.csv
Medium categories: Necrosis.csv, Hyperplasia.csv
Big categories: Pregnancy.csv, Rats.csv

Download ohsumed-allcats-6.zip. It contains 1 csv file that covers all 6 classes.

Epinions

Epinions.com is a website where people can post reviews of products and services. It covers a wide variety of topics. For this homework assignment, we downloaded a set of 12,000 posts about digital cameras and cars.

epinions.zip contains a directory with two csv files

Car category: epinions-1.csv. This dataset contains 6,000 messages. 50% are about automobiles. 50% are about another topic.
Ford category: epinions-2.csv. This dataset contains 6,000 messages. 50% are about Ford automobiles. 50% are about other automobiles.

Learning Algorithms

Conduct your experiments with the following three learning algorithms provided by Weka.

J48 decision trees: trees/J48
Naive Bayes: Bayes/NaiveBayes
SVM: functions/LibSVM. The default configuration uses a RBF kernel. Change the configuration to use a linear kernel. Click on the line that shows the LibSVM parameters:
LibSVM -S 0 -K 2 -D 3 ...
Change kernelType to "linear", and click OK. You will see that the LibSVM parameter -K is now set to 0. This has already been done for you in the automation scripts provided, but you will still need to do this for the Weka GUI program.LibSVM doesn't come with Weka by default, so you will need to download our automation script below and follow the automation guide to install it.

Automation

This homework requires you to train a large number of classifiers. You must do Experiment #1 manually so that you gain some experience with Weka. However, you may use scripts (e.g., dndw_v1.zip) to automate Homeworks #2 and #3, which will allow you to spend most of your time analyzing experimental results instead of running Weka. If you try to run all of the experiments manually, it will be tedious and time-consuming.

See the Weka Automation Workflow Guide and the Brief LightSide Tutorial for more information. Questions about the automation scripts should be posted on Piazza. While we have done testing on various computer setups before delivering this automation solution, it is still possible that something might go wrong. For this reason, we strongly recommend starting on this homework earlier so that any technical difficulties can be resolved early.

Experiments

This assignment can be viewed as creating a set of classifiers that decide whether or not document d should be assigned to category c. Each classifier is solving a two-class problem, where the classes are "positive" (assign to category) and "negative" (don't assign to category). This is typical for problems that truly are two-category problems (e.g., the epinions data) and problems where multiple categories can be assigned to each document (e.g., the Reuters and OHSUMED data).

In these datasets, a person has already assigned each document to one or more categories ("ground truth" or "gold standard" labels). The machine learning software will use some of these documents as training data to learn a classifier; the remaining documents will be used to test the accuracy of the learned classifiers.

The first experiment creates a set of baseline classifiers for three datasets. The second experiment tests the effect of varying the numbers of features. The third experiment allows you to test your own ideas about how to form features. Each experiment is described in more detail, below.

When you report experimental results, if your values are in the range 0-100, only provide 1 decimal point of precision in your measurements (e.g., 96.1); if your values are in the range 0-1, only provide 3 decimal points of precision in your measurements (e.g., 0.961). Greater accuracy is unnecessary for this task.

In all of your experiments, be sure to report Precision, Recall and F-measure for the positive category (e.g., for the Reuters dataset, acq, coffee, earn, etc). For small categories, the classifier may be better at recognizing what is not in the category than what is in the category. Make sure that you are reporting the correct Precision and Recall values.

Experiment #1: Baselines

Create two baseline representations for each dataset. The baseline representations are defined as follows.

baseline 1: unigrams, binary features (the default), frequency threshold=10.

If this baseline is too big for weka to handle on your laptop, you can reduce its size, for example, by pruning out any feature with a kappa value of 0.001 or less. If you prune your baseline feature set, be sure to discuss this in your report.
baseline 2: unigrams, binary features, frequency threshold=10, just the top 10 features per class, as determined by kappa. See Chapter 5 (Data Restructuring) in the Lightside Researcher's Manual to learn about filtering features.

Export the baseline representations to arff files.

Test your baseline representations using J48, Naive Bayes, and SVM. Report Precision, Recall and F-measure (for the positive categories) obtained with each baseline representation. Also report the average time required to build models in weka for each dataset for each baseline representation. Measurements of running time do not need to be super precise. The goal is for you to notice the differences in running times for the different algorithms.

Discuss the differences among the two baselines and the different algorithms. Pay particular attention to differences in accuracy and efficiency, and what may have caused them.

Experiment #2: Feature selection

Test the effects of different numbers of features in baseline #1. Try five different sizes (numbers of features). Test these new representations using the datasets and learning algorithms tested in experiment #1. Are small representations more effective, or are large representations more effective? Does each dataset and/or learning algorithm behave the same way? Report your results in a tabular format. Discuss your results. Does having more features make a difference? If so, is it an important difference? Are the effects similar for small and large classes?

Experiment #3: Your Representations

Develop and test two custom representations of your own design. You can try combining different choices that worked well in experiments 1-2, and/or you can explore different LightSIDE options that weren't explored in experiments 1-2. Test these new representations using the datasets and algorithms tested in experiment #2. Discuss your reasons for developing each representation, and how well they worked.

Report

You must describe your work and your analysis of the experimental results in a written report. Your analysis is a significant part of the grade, so be sure to leave enough time to do a good job.

A report template is provided in Microsoft Word and pdf formats. Your report must follow this template, and be in pdf format. Name your report AndrewID-HW2.pdf.

The template provides specific instructions about what information to provide for each experiment. However, generally speaking, you should discuss any trends that you observe about what works well or doesn't; database-specific characteristics; or algorithm-specific characteristics. Discuss whether the different choices work as you expected, or whether there surprises. If things didn't work as you expected, what might be the causes?

CSV File

Through the use of the DNDW script you will have generated a "results.csv" file that logs your experiment results. It can be found in the same directory as the DNDW script. Turn this file in on Blackboard.

Submission

Submit your report and CSV file by Blackboard before the deadline.

FAQ

This seems like a lot of datasets. Do I really need to use them all?
The goal is to let you see how text categorization behaves under different situations. There is a lot of variation in text categorization experiments. You can only see that by using several datasets, several learning algorithms, and several experiments. Fortunately, you only need to make a few clicks to convert text into features, load features into weka, and run an experiment in weka. Then, you read the values out of weka and enter them in a table. Easy!
There are several different types of Naive Bayes and Decision Tree algorithms. Which should I use? Can I make other choices?
You must submit results for the basic Naive Bayes classifier and the J48 Decision Tree package. You are welcome (and encouraged) to try other algorithms, as long as you explain and justify your choice.
Some of the experiments are taking a really long time. Is this normal?
Yes. Weka is very convenient, but it is not especially efficient, and you are running it on a laptop computer, which is not especially powerful. Naive Bayes should run the quickest, followed by decision trees.

If speed is becoming a major problem for you, you may generate a "reduced" baseline 1, for example using 25-50% of the features. Be sure to discuss this in your report.
I'm running out of memory for some experiments.
The default maximum memory for the JVM is rather small; you may need to increase the memory available for Java. Please refer to this page for help.

These experiments should run on a machine with 2GB of RAM. If Weka refuses to launch with the new memory settings, shut down other programs that might be consuming memory. This should free up more memory that you can allocate to the Java virtual machine. It may also be helpful to shut down weka between experiments, so that it is forced to free memory from old experiments.

Weka and LightSide are not very good about memory management. Shut down other applications while you are running Weka and LightSide. If you continue to have problems, exit Weka and LightSide between experiments.

Finally, as a last resort, you may edit the .ARFF file to produce a smaller dataset that will fit in memory on your PC (fewer instances). This will affect the quality of your experimental results, so do this only as a last resort.

Be sure to mention any reduction of the data set in your write-up, so that we understand how your experimental results were obtained.
I have lots of memory on my computer, can I make weka go faster?
The default settings for the Weka instances that run through DNDW is set to be at a 2GB max heap size. You may edit the configuration.bat file inside DNDW to increase the SET maxheap=2048M value to something larger than 2GB (note the new number has to be a power of 2 eg. 4096M).
The default install of Weka 3.6.8 doesn't seem to include LibSVM.
Download LibSVM. Copy the jar file into your Weka directory. Make sure that your CLASSPATH environment variable includes that directory.
The epinions data seems to have four categories (auto, camera, Ford, OtherCar), but the tables in the template have only two rows for categories. I am confused about what values to report.
The instructions say to use the auto and Ford categories; these are the positive classes for Homework 2. You don't report results for the negative classes. In this case the negative classes correspond to meaningful classes, but we are ignoring that for Homework 2.
For Weka on Windows, I'm having trouble installing LibSVM.
If changes in your RunWeka.bat file are not having an effect, try modifying the RunWeka.ini file. You may also find it useful to consult Weka's LibSVM wikispaces page LibSVM, which addresses common problems and offers useful advice.

Jamie Callan

Homework #2: Text Categorization Due Apr 1, 9:59pm (Adelaide time)