de.jstacs.utils
Class StatisticalModelTester

java.lang.Object
  extended by de.jstacs.utils.StatisticalModelTester

public class StatisticalModelTester
extends Object

This class is useful for some test for any (discrete) models. It implements several statistics (log-likelihood, Shannon entropy, AIC, BIC, ...) to compare models.

Author:
Jens Keilwagen
See Also:
StatisticalModel

Constructor Summary
StatisticalModelTester()
           
 
Method Summary
static double getKLDivergence(StatisticalModel m1, StatisticalModel m2, int length)
          Returns the Kullback-Leibler-divergence D(p_m1||p_m2).
static double getLogLikelihood(StatisticalModel m, DataSet data)
          Returns the log-likelihood of a DataSet data for a given model m.
static double getLogLikelihood(StatisticalModel m, DataSet data, double[] weights)
          Returns the log-likelihood of a DataSet data for a given model m.
static double getMarginalDistribution(StatisticalModel m, int[] constraint)
          This method computes the marginal distribution for any discrete model m and all sequences that fulfill the constraint , if possible.
static double getMaxOfDeviation(StatisticalModel m1, StatisticalModel m2, int length)
          This method computes the maximum deviation between the probabilities for all sequences of length for discrete models m1 and m2.
static Sequence getMostProbableSequence(SequenceScore m, int length)
          Returns one most probable sequence for the discrete model m.
static double getShannonEntropy(StatisticalModel m, int length)
          This method computes the Shannon entropy for any discrete model m and all sequences of length, if possible.
static double getShannonEntropyInBits(StatisticalModel m, int length)
          This method computes the Shannon entropy in bits for any discrete model m and all sequences of length, if possible.
static double getSumOfDeviation(StatisticalModel m1, StatisticalModel m2, int length)
          This method computes the sum of deviations between the probabilities for all sequences of length for discrete models m1 and m2.
static double getSumOfDistribution(StatisticalModel m, int length)
          This method computes the marginal distribution for any discrete model m and all sequences of length, if possible.
static double getSymKLDivergence(StatisticalModel m1, StatisticalModel m2, int length)
          Returns the difference of the Kullback-Leibler-divergences, i.e.
static double getValueOfAIC(StatisticalModel m, DataSet s, int k)
          This method computes the value of Akaikes Information Criterion (AIC).
static double getValueOfBIC(StatisticalModel m, DataSet s, int k)
          This method computes the value of the Bayesian Information Criterion (BIC).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StatisticalModelTester

public StatisticalModelTester()
Method Detail

getKLDivergence

public static double getKLDivergence(StatisticalModel m1,
                                     StatisticalModel m2,
                                     int length)
                              throws Exception
Returns the Kullback-Leibler-divergence D(p_m1||p_m2).

Computes \sum_x p(x|m1) * \log \frac{p(x|m1)}{p(x|m2)}.

Parameters:
m1 - one discrete model
m2 - another discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the Kullback-Leibler-divergence
Throws:
Exception - if something went wrong

getSymKLDivergence

public static double getSymKLDivergence(StatisticalModel m1,
                                        StatisticalModel m2,
                                        int length)
                                 throws Exception
Returns the difference of the Kullback-Leibler-divergences, i.e. D(p_m1||p_m2) - D(p_m2||p_m1).

Computes \sum_x (p(x|m1)-p(x|m2)) * \log \frac{p(x|m1)}{p(x|m2)}.

Parameters:
m1 - one discrete model
m2 - another discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the difference of the Kullback-Leibler-divergence
Throws:
Exception - if something went wrong

getLogLikelihood

public static double getLogLikelihood(StatisticalModel m,
                                      DataSet data)
                               throws Exception
Returns the log-likelihood of a DataSet data for a given model m.

Parameters:
m - the given model
data - the DataSet
Returns:
the log-likelihood of data
Throws:
Exception - if something went wrong

getLogLikelihood

public static double getLogLikelihood(StatisticalModel m,
                                      DataSet data,
                                      double[] weights)
                               throws Exception
Returns the log-likelihood of a DataSet data for a given model m.

Parameters:
m - the given model
data - the DataSet
weights - the weight for each element of the DataSet
Returns:
the log-likelihood of data
Throws:
Exception - if something went wrong

getMarginalDistribution

public static double getMarginalDistribution(StatisticalModel m,
                                             int[] constraint)
                                      throws Exception
This method computes the marginal distribution for any discrete model m and all sequences that fulfill the constraint , if possible.

Parameters:
m - a discrete model
constraint - constraint[i] < 0 stands for an irrelevant position, constraint[i] = c with 0 <= c < m.getAlphabets()[(m.getLength==0)?0:i].getAlphabetLength() is the encoded character of position i
Returns:
the marginal distribution for a discrete model
Throws:
Exception - if something went wrong

getMaxOfDeviation

public static double getMaxOfDeviation(StatisticalModel m1,
                                       StatisticalModel m2,
                                       int length)
                                throws Exception
This method computes the maximum deviation between the probabilities for all sequences of length for discrete models m1 and m2.

Parameters:
m1 - one discrete model
m2 - another discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the maximum deviation between the probabilities
Throws:
Exception - if something went wrong

getMostProbableSequence

public static Sequence getMostProbableSequence(SequenceScore m,
                                               int length)
                                        throws Exception
Returns one most probable sequence for the discrete model m. (Maybe there are more than one most probable sequences. In this case only one of them is returned.)

This is only a standard implementation. For some special models like Markov models it is possible to compute the probabilities of the sequences much faster by using a dynamic-programming-algorithm.

Parameters:
m - the discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
one most probable sequence
Throws:
Exception - if something went wrong

getShannonEntropy

public static double getShannonEntropy(StatisticalModel m,
                                       int length)
                                throws Exception
This method computes the Shannon entropy for any discrete model m and all sequences of length, if possible.

Parameters:
m - the discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the Shannon entropy for a discrete model
Throws:
Exception - if something went wrong

getShannonEntropyInBits

public static double getShannonEntropyInBits(StatisticalModel m,
                                             int length)
                                      throws Exception
This method computes the Shannon entropy in bits for any discrete model m and all sequences of length, if possible.

Parameters:
m - the discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the Shannon entropy in bits for a discrete model
Throws:
Exception - if something went wrong

getSumOfDeviation

public static double getSumOfDeviation(StatisticalModel m1,
                                       StatisticalModel m2,
                                       int length)
                                throws Exception
This method computes the sum of deviations between the probabilities for all sequences of length for discrete models m1 and m2.

Parameters:
m1 - one discrete model
m2 - another discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the sum of deviations between the probabilities
Throws:
Exception - if something went wrong

getSumOfDistribution

public static double getSumOfDistribution(StatisticalModel m,
                                          int length)
                                   throws Exception
This method computes the marginal distribution for any discrete model m and all sequences of length, if possible. So this method can be used to give a hint whether a model is a distribution or if some mistakes are in the implementation.

It is expected that this method delivers the value 1.0, but because of the limited precision in Java the value 1.0 is unrealistic.

Math.abs( 1.0d - getSumOfDistribution( m, length ) should be smaller than 1E-10.

Parameters:
m - the discrete model
length - the length of the sequence (for inhomogeneous models length has to be SequenceScore.getLength())
Returns:
the marginal distribution for a discrete model
Throws:
Exception - if something went wrong

getValueOfAIC

public static double getValueOfAIC(StatisticalModel m,
                                   DataSet s,
                                   int k)
                            throws Exception
This method computes the value of Akaikes Information Criterion (AIC). It uses the formula: AIC = 2 * log L(t,x) - 2*k, where L(t,x) is the likelihood of the DataSet and k is the number of parameters in the model.

The value of the AIC can be used for model selection.

Parameters:
m - a trained model
s - the DataSet for the test
k - the number of parameters of the model m
Returns:
the value of AIC
Throws:
Exception - if something went wrong

getValueOfBIC

public static double getValueOfBIC(StatisticalModel m,
                                   DataSet s,
                                   int k)
                            throws Exception
This method computes the value of the Bayesian Information Criterion (BIC). It uses the formula: BIC = 2 * log L(t,x) - k * log n, where L(t,x) is the likelihood of the DataSet, k is the number of parameters in the model and n is the number of sequences in the DataSet.

The value of the BIC can be used for model selection.

Parameters:
m - a trained model
s - the DataSet for the test
k - the number of parameters of the model m
Returns:
value of AIC
Throws:
Exception - if something went wrong