Using LingPipe to Classify Text


Learn how to use LingPipe for text classification in this tutorial by Richard M Reese, the author of several Java books and a C Pointer book.
Various Natural Language Processing (NLP) APIs can be used to perform text classification. Classification uses predefined categories. You can use OpenNLP, Stanford API, or LingPipe to demonstrate the various classification approaches. However, this article will focus on LingPipe as it offers several different classification approaches.
You’ll learn how to use LingPipe to demonstrate a number of classification tasks, including general text classification using trained models, sentiment analysis, and language identification. The necessary code files for this article can be found at https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Second-Edition/tree/master/Chapter08.
LingPipe comes with training data for several categories. The categories array contains the names of the categories packaged with LingPipe:

String[] categories = {"soc.religion.christian", "talk.religion.misc","alt.atheism","misc.forsale"};

The DynamicLMClassifier class is used to perform the actual classification. It is created using the categories array, giving it the names of the categories to use. The nGramSize value specifies the number of contiguous items in a sequence used in the model for classification purposes:

int nGramSize = 6;
DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);

Training text using the Classified class
General text classification using LingPipe involves training the DynamicLMClassifier class using training files and then using the class to perform the actual classification. LingPipe comes with several training datasets, as found in the LingPipe directory named demos/data/fourNewsGroups/4news-train. You’ll use these datasets to illustrate the training process. This example is a simplified version of the process found at http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html.Start by declaring the trainingDirectory:

String directory = ".../demos";
File trainingDirectory = new File(directory + "/data/fourNewsGroups/4news-train");

In the trainingDirectory, there are four subdirectories whose names are listed in the categories array. In each subdirectory, there is a series of files with numeric names. These files contain newsgroup (http://qwone.com/~jason/20Newsgroups/) data dealing with the name of the subdirectories.
The process of training the model involves using each file and category with the DynamicLMClassifier class’s handle method. The method will use the file to create a training instance for the category and then augment the model with this instance. The process uses nested for loops.
The outer for loop creates a File object using the directory’s name and then applies the list method against it. The list method returns a list of the files in the directory. The names of these files are stored in the trainingFiles array, which will be used in the inner for loop:

for (int i = 0; i < categories.length; ++i) {
    File classDir = new File(trainingDirectory, categories[i]);
    String[] trainingFiles = classDir.list();
    // Inner for-loop
}

The inner for loop, as shown in the following code, will open each file and read the text from the file. The Classification class represents a classification with a specified category. It is used with the text to create a Classified instance. The DynamicLMClassifier class’s handle method updates the model with the new information:

for (int j = 0; j < trainingFiles.length; ++j) {
    try {
        File file = new File(classDir, trainingFiles[j]);
        String text = Files.readFromFile(file, "ISO-8859-1");
        Classification classification = new Classification(categories[i]);
        Classified<CharSequence> classified = new Classified<>(text, classification);
        classifier.handle(classified);
    } catch (IOException ex) {
            // Handle exceptions
    }
}

You can alternatively use the com.aliasi.util.Files class, instead, in java.io.File; otherwise, the readFromFile method will not be available.
The classifier can be serialized for later use, as shown in the following code. The AbstractExternalizable class is a utility class that supports the serialization of objects. It has a static compileTo method that accepts a Compilable instance and a File object. It writes the object to the file, as follows:

The loading of the classifier will be illustrated in the Classifying text using LingPipe section.

Using other training categories
Other newsgroup data can be found at http://qwone.com/~jason/20Newsgroups/. These collections of data can be used to train other models, as listed in the following table. Although there are only 20 categories, they can be useful training models. Three different downloads are available. Some have been sorted, and in others, duplicate data has been removed

Classifying text using LingPipe
To classify text, use the DynamicLMClassifier class’s classify method. Demonstrate its use with two different text sequences:

  • forSale: This is from http://www.homes.com/for-sale/, where you use the first complete sentence
  • martinLuther: This is from http://en.wikipedia.org/wiki/Martin_Luther, where you use the first sentence of the second paragraph
These strings are declared here:


String forSale =
    "Finding a home for sale has never been "
     + "easier. With Homes.com, you can search new "
     + "homes, foreclosures, multi-family homes, "
     + "as well as condos and townhouses for sale. "
     + "You can even search our real estate agent "
     + "directory to work with a professional "
     + "Realtor and find your perfect home.";
String martinLuther =
    "Luther taught that salvation and subsequently "
    + "eternity in heaven is not earned by good deeds "
    + "but is received only as a free gift of God's "
    + "grace through faith in Jesus Christ as redeemer "
    + "from sin and subsequently eternity in Hell.";

To reuse the classifier that is serialized in the previous section, use the AbstractExternalizable class’s readObject method, as shown in the following code. Use the LMClassifier class instead of the DynamicLMClassifier class. They both support the classify method but the DynamicLMClassifier class cannot be readily serialized:

LMClassifier classifier = null;
try {
    classifier = (LMClassifier)
    AbstractExternalizable.readObject(new File("classifier.model"));
} catch (IOException | ClassNotFoundException ex) {
// Handle exceptions
}

In the following code sequence, apply the LMClassifier class’s classify method. This returns a JointClassification instance, which you can use to determine the best match:

JointClassification classification = classifier.classify(text);
System.out.println("Text: " + text);
String bestCategory = classification.bestCategory();
System.out.println("Best Category: " + bestCategory);

For the forSale text, you’ll get the following output:

Text: Finding a home for sale has never been easier. With Homes.com, you can search new homes, foreclosures, multi-family homes, as well as condos and townhouses for sale. You can even search our real estate agent directory to work with a professional Realtor and find your perfect home.
Best Category: misc.forsale

For the martinLuther text, you’ll get the following output:

Text: Luther taught that salvation and subsequently eternity in heaven is not earned by good deeds but is received only as a free gift of God’s grace through faith in Jesus Christ as redeemer from sin and subsequently eternity in Hell.
Best Category: soc.religion.christian

They’ve both classified the text correctly.

Sentiment analysis using LingPipe

Sentiment analysis is performed in a very similar manner to general text classification. One difference is that it uses only two categories: positive and negative.
You need to use data files to train your model. Use a simplified version of the sentiment analysis performed at http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html using sentiment data that was developed for movies (http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz). This data was developed from 1,000 positive reviews and 1,000 negative reviews of movies that are in IMDb’s movie archives.
These reviews need to be downloaded and extracted. A txt_sentoken directory will be extracted along with its two subdirectories: neg and pos. Both subdirectories contain movie reviews. Although some of these files can be held in reserve to evaluate the model that was created, you can use all of them to simplify the explanation.
Start with the reinitialization of variables declared in the Using LingPipe to classify text section. The categories array is set to a two-element array to hold the two categories. The classifier variable is assigned a new DynamicLMClassifier instance using the new category array and nGramSize of size 8:

categories = new String[2];
categories[0] = "neg";
categories[1] = "pos";
nGramSize = 8;
classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);

Create a series of instances based on the content found in the training files. Here, there are only two categories to process:

String directory = "...";
File trainingDirectory = new File(directory, "txt_sentoken");
for (int i = 0; i < categories.length; ++i) {
   Classification classification = new Classification(categories[i]);
   File file = new File(trainingDirectory, categories[i]);
   File[] trainingFiles = file.listFiles();
   for (int j = 0; j < trainingFiles.length; ++j) {
      try {
         String review = Files.readFromFile(trainingFiles[j], "ISO-8859-1");
         Classified<CharSequence> classified = new Classified<>(review, classification);
         classifier.handle(classified);
      } catch (IOException ex) {
         ex.printStackTrace();
      }
   }
}

The model is now ready to be used. Use the review for the movie Forrest Gump:

String review = "An overly sentimental film with a somewhat "
+ "problematic message, but its sweetness and charm "
+ "are occasionally enough to approximate true depth "
+ "and grace. ";

Use the classify method to perform the actual work. It returns a Classification instance whose bestCategory method returns the best category, as shown here:

Classification classification = classifier.classify(review);
String bestCategory = classification.bestCategory();
System.out.println("Best Category: " + bestCategory);

When executed, you get the following output:

Best Category: pos

This approach will also work well for other categories of text.

Language identification using LingPipe

LingPipe comes with a model called langid-leipzig.classifier, which is trained for several languages and is found in the demos/models directory. The following table contains a list of supported languages. This model was developed using training data derived from the Leipzig Corpora Collection (http://corpora.uni-leipzig.de/):

To use this model, you can use the same code used in the Classifying text using LingPipe section. Start with the same movie review of Forrest Gump:

String text = "An overly sentimental film with a somewhat "
+ "problematic message, but its sweetness and charm "
+ "are occasionally enough to approximate true depth "
+ "and grace. ";
System.out.println("Text: " + text);

The LMClassifier instance is created using the langid-leipzig.classifier file:

LMClassifier classifier = null;
try {
    classifier = (LMClassifier)
    AbstractExternalizable.readObject(new File(".../langid-leipzig.classifier"));
} catch (IOException | ClassNotFoundException ex) {
    // Handle exceptions
}

The classify method is used, followed by the application of the bestCategory method, to obtain the best language fit, as shown here:

Classification classification = classifier.classify(text);
String bestCategory = classification.bestCategory();
System.out.println("Best Language: " + bestCategory);

The output is as follows, with English being chosen as the language:

Text: An overly sentimental film with a somewhat problematic message, but its sweetness and charm are occasionally enough to approximate true depth and grace.
Best Language: en

The following code example uses the first sentence of the Swedish Wikipedia entry in Swedish (http://sv.wikipedia.org/wiki/Svenska) for the text:

text = "Svenska är ett östnordiskt språk som talas av cirka "
+ "tio miljoner personer[1], främst i Finland "
+ "och Sverige.";

The output, as shown here, correctly selects the Swedish language:

Text: Svenska är ett östnordiskt språk som talas av cirka tio miljoner personer[1], främst i Finland och Sverige.
Best Language: se

Training can be conducted using the same method as the previous LingPipe models. Another consideration when performing language identification is that the text may be written in multiple languages, which can further complicate the language detection process.
If you found this article interesting, you can explore Richard M Reese’s Natural Language Processing with Java to explore various approaches to organize and extract useful text from unstructured data using Java. This book will help you automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization.

Leave a comment

Your email address will not be published. Required fields are marked *