Second main course: Classifiers

From Jstacs
Jump to navigationJump to search

Classifiers allow to classify, i.e., label, previously uncharacterized data. In Jstacs, we provide the abstract class AbstractClassifier that declares three important methods besides several others.

The first method trains a classifier, i.e., it somehow adjusts to the train data:

public void train( DataSet... s ) throws Exception {

The second method classifies a given Sequence:

public abstract byte classify( Sequence seq ) throws Exception;

If we like to classify for instance the first sequence of a data set, we might use

System.out.println( cl.classify( data[0].getElementAt(0) ) );

In addition to this method, another method classify(DataSet) exists that performs a classification for all Sequence s in a DataSet.

The third method allows for assessing the performance. Typically this is done on test data

public final ResultSet evaluate( PerformanceMeasureParameterSet params, boolean exceptionIfNotComputeable, DataSet... s ) throws Exception {

where params is a ParameterSet of performance measures (cf. subsection #Performance measures), exceptionIfNotComputeable indicates if an exception should be thrown if a performance measure could not be computed, and s is an array of data sets, where dimension i contains data of class i.

The abstract sub-class AbstractScoreBasedClassifier †of AbstractClassifier adds an additional method for computing a joint score for an input Sequence †and a given class:

public double getScore( Sequence seq, int i ) throws Exception {

Similar to the classify method. For two-class problems, the method

public double[] getScores( DataSet s ) throws Exception {

allows for computing the score-differences given foreground and background class for all Sequence s in the DataSet s. Such scores are typically the sum of the a-priori class log-score or log-probability and the score returned by getLogScore of SequenceScore or getLogProb of StatisticalModel.

Sometimes data is not split into test and train data for several diverse reasons, as for instance limited amount of data. In such cases, it is recommended to utilize some repeated procedure to split the data, train on one part and classify on the other part. In Jstacs, we provide the abstract class ClassifierAssessment that allows to implement such procedures. In subsection #Assessment, we describe how to use ClassifierAssessment and its extension.

But at first, we will focus on classifiers. Any classifier in Jstacs is an extension of the AbstractClassifier. In this section, we present on two concrete implementations, namely TrainSMBasedClassifier (cf. subsection #TrainSMBasedClassifier) and GenDisMixClassifier (cf. subsection #GenDisMixClassifier).

TrainSMBasedClassifier

The class TrainSMBasedClassifier implements a classifier on TrainableStatisticalModel s, i.e., for each class the classifier holds a TrainableStatisticalModel.

If we like to build a binary classifier using PWMs for each class, we first create a PWM that is a TrainableStatisticalModel.


TrainableStatisticalModel pwm = TrainableStatisticalModelFactory.createPWM( alphabet, 10, 4.0 );


Then we can use this instance to create the classifier using


AbstractClassifier cl = new TrainSMBasedClassifier( pwm, pwm );


Thereby, we do not need to clone the PWM instance, as this is done internally for safety reasons. If we like to build a classifier that allows to distinguish between [math]N[/math] classes, we use the same constructor but provide [math]N[/math] TrainableStatisticalModel s.

If we train a TrainSMBasedClassifier, the train method of the internally used TrainableStatisticalModel s is called. For classifying a sequence, the TrainSMBasedClassifier calls getLogProbFor of the internally used TrainableStatisticalModel s and incorporates some class weight.

GenDisMixClassifier

The class GenDisMixClassifier implements a classifier using the unified generative-discriminative learning principle to train the internally used DifferentiableStatisticalModel s. In analogy to the TrainSMBasedClassifier, the GenDisMixClassifier holds for each class a DifferentiableStatisticalModel.

If we like to build a GenDisMixClassifier, we have to provide the parameters for this classifier:


GenDisMixClassifierParameterSet ps = new GenDisMixClassifierParameterSet( alphabet, 10, (byte) 10, 1E-6, 1E-9, 1, false, KindOfParameter.PLUGIN, true, 2 );


This line of code generate a ParameterSet for a GenDisMixClassifier. It states the used AlphabetContainer, the sequence length, an indicator for the numerical algorithm that is used during training, an epsilon for stopping the numerical optimization, a line epsilon for stopping the line search within the numerical optimization, a start distance for the line search, a switch that indicates whether the free or all parameter should be used, an enum that indicates the kind of class parameter initialization, a switch that indicates whether normalization should be used during optimization, and the number of threads used during numerical optimization.

If we like to build a binary classifier using PWMs for each class, we create a PWM that is a DifferentiableStatisticalModel.


DifferentiableStatisticalModel pwm2 = new BayesianNetworkDiffSM( alphabet, 10, 4.0, true, new InhomogeneousMarkov(0) );


Now, we are able to build a GenDisMixClassifier that uses the maximum likelihood learning principle.


cl = new GenDisMixClassifier(ps, DoesNothingLogPrior.defaultInstance, LearningPrinciple.ML, pwm2, pwm2 );


In close analogy, we can build a GenDisMixClassifier that uses the maximum conditional likelihood learning principle, if we use LearningPrinciple.MCL.

However, if we like to use a Bayesian learning principle we have to specify a prior that represents our prior knowledge. One of the most popular priors is the product Dirichlet prior. We can create an instance of this prior using


LogPrior prior = new CompositeLogPrior();


This class utilizes methods of DifferentiableStatisticalModel (cf. getLogPriorTerm() and addGradientOfLogPriorTerm(double[], int)) to provide the correct prior.

Given a prior, we can build a GenDisMixClassifier using for instance the maximum supervised learning principle:


cl = new GenDisMixClassifier(ps, prior, LearningPrinciple.MSP, pwm2, pwm2 );


Again in close analogy, we can build a GenDisMixClassifier that uses the maximum a-posteriori learning principle, if we use LearningPrinciple.MAP.

Alternative, we can build a GenDisMixClassifier that utilize the unified generative-discriminative learning principle. If we like to do so, we have to provide a weighting that sums to 1 and represents the weights for the conditional likelihood, the likelihood and the prior.


cl = new GenDisMixClassifier(ps, prior, new double[]{0.4,0.1,0.5}, pwm2, pwm2 );


Performance measures

If we like to assess the performance of any classifier, we have to use the method evaluate (see beginning of this section). The first argument of this method is a PerformanceMeasureParameterSet that hold the performance measures to be computed. The most simple way to create an instance is


PerformanceMeasureParameterSet measures = PerformanceMeasureParameterSet.createFilledParameters( false, 0.999, 0.95, 0.95, 1 );


which yields an instance with all standard performance measures of Jstacs and specified parameters. The first argument states that all performance measures should be included. If we would change the argument to true, only numerical performance measures would be included an the returned instance would be a NumericalPerformanceMeasureParameterSet. The other four arguments are parameters for some performance measures.

Another way of creating a PerformanceMeasureParameterSet is to directly use performance measures extending the class AbstractPerformanceMeasure. For instance if we like to use the area under the curve (auc) for ROC and PR curve, we create


AbstractPerformanceMeasure[] m = {new AucROC(), new AucPR()};


Based on this array, we can create a PerformanceMeasureParameterSet that only contains these performance measures.


measures = new PerformanceMeasureParameterSet( m );


Assessment

If we like to assess the performance of any classifier based on an array of data sets that is not split into test and train data, we have to use some repeated procedure. In Jstacs, we provide the ClassifierAssessment that is the abstract super class of any such an procedure. We have already implemented the most widely used procedures (cf. KFoldCrossValidation and RepeatedHoldOutExperiment).

Before performing a ClassifierAssessment, we have to define a set of numerical performance measures. The performance measure have to be numerical to allow for an averaging. The most simple way to create such a set is

NumericalPerformanceMeasureParameterSet numMeasures = PerformanceMeasureParameterSet.createFilledParameters();

However, you can choose other measures as described in the previous subsection.

In this subsection, we exemplarily present how to perform a k-fold cross validation in Jstacs. First, we have to create an instance of KFoldCrossValidation. There several constructor to do so. Here, we use the constructor that used AbstractClassifier s.


ClassifierAssessment assessment = new KFoldCrossValidation( cl );


Second, we have to specify the parameters of the KFoldCrossValidation.


KFoldCrossValidationAssessParameterSet params = new KFoldCrossValidationAssessParameterSet( PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, cl.getLength(), true, 10 );


These parameter are the partition method, i.e., the way how to count entries during a partitioning, the sequence length for the test data, a switch indicating whether an exception should be thrown if a performance measure could not be computed (cf. evaluate in AbstractClassifier), and the number of repeats [math]k[/math].

Now, we are able to perform a ClassifierAssessment just by calling the method assess.


System.out.println( assessment.assess( numMeasures, params, data ) );


We print the result (cf. ListResult) of this assessment to standard out. If we like to perform other ClassifierAssessment s, as for instance, a RepeatedHoldOutExperiment, we have to use a specific ParameterSet †(cf. KFoldCrossValidation and KFoldCrossValidationAssessParameterSet).