de.jstacs.sequenceScores.statisticalModels.trainable.mixture
Class MixtureTrainSM

java.lang.Object
  extended by de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
      extended by de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
          extended by de.jstacs.sequenceScores.statisticalModels.trainable.mixture.MixtureTrainSM
All Implemented Interfaces:
SequenceScore, StatisticalModel, TrainableStatisticalModel, Storable, Cloneable
Direct Known Subclasses:
SharedStructureMixture

public class MixtureTrainSM
extends AbstractMixtureTrainSM

The class for a mixture model of any TrainableStatisticalModels.

If you use Gibbs sampling temporary files will be created in the Java temp folder. These files will be deleted if no reference to the current instance exists and the Garbage Collector is called. Therefore it is recommended to call the Garbage Collector explicitly at the end of any application.

Author:
Jens Keilwagen, Berit Haldemann

Nested Class Summary
 
Nested classes/interfaces inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
AbstractMixtureTrainSM.Algorithm, AbstractMixtureTrainSM.Parameterization
 
Field Summary
 
Fields inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
algorithm, algorithmHasBeenRun, alternativeModel, best, burnInTest, componentHyperParams, compProb, counter, dimension, estimateComponentProbs, file, filereader, filewriter, initialIteration, logWeights, model, optimizeModel, sample, samplingIndex, seqWeights, sostream, starts, stationaryIteration, weights
 
Fields inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
alphabets, length
 
Constructor Summary
  MixtureTrainSM(int length, TrainableStatisticalModel[] models, double[] weights, int starts, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization)
          Creates an instance using EM and fixed component probabilities.
  MixtureTrainSM(int length, TrainableStatisticalModel[] models, double[] weights, int starts, int initialIteration, int stationaryIteration, BurnInTest burnInTest)
          Creates an instance using Gibbs Sampling and fixed component probabilities.
protected MixtureTrainSM(int length, TrainableStatisticalModel[] models, int starts, boolean estimateComponentProbs, double[] componentHyperParams, double[] weights, AbstractMixtureTrainSM.Algorithm algorithm, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization, int initialIteration, int stationaryIteration, BurnInTest burnInTest)
          Creates a new MixtureTrainSM.
  MixtureTrainSM(int length, TrainableStatisticalModel[] models, int starts, double[] componentHyperParams, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization)
          Creates an instance using EM and estimating the component probabilities.
  MixtureTrainSM(int length, TrainableStatisticalModel[] models, int starts, double[] componentHyperParams, int initialIteration, int stationaryIteration, BurnInTest burnInTest)
          Creates an instance using Gibbs Sampling and sampling the component probabilities.
  MixtureTrainSM(StringBuffer xml)
          The constructor for the interface Storable.
 
Method Summary
 double[][] doFirstIteration(DataSet data, double[] dataWeights, double[][] partitioning)
          This method enables you to train a mixture model with a fixed start partitioning.
protected  double[][] doFirstIteration(double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method will do the first step in the train algorithm for the current model on the internal data set.
protected  Sequence[] emitDataSetUsingCurrentParameterSet(int n, int... lengths)
          The method returns an array of sequences using the current parameter set.
protected  double getLogProbUsingCurrentParameterSetFor(int component, Sequence s, int start, int end)
          Returns the logarithmic probability for the sequence and the given component using the current parameter set.
protected  double getNewWeights(double[] dataWeights, double[] w, double[][] seqweights)
          Computes sequence weights and returns the score.
protected  void setTrainData(DataSet data)
          This method is invoked by the train-method and sets for a given data set the data set that should be used for train.
 String toString(NumberFormat nf)
          This method returns a String representation of the instance.
 
Methods inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
algorithmHasBeenRun, checkLength, checkModelsForGibbsSampling, clone, continueIterations, continueIterations, createSeqWeightsArray, doFirstIteration, doFirstIteration, draw, emitDataSet, extendSampling, extractFurtherInformation, finalize, fromXML, getCharacteristics, getFurtherInformation, getIndexOfMaximalComponentFor, getInstanceName, getLogPriorTerm, getLogPriorTermForComponentProbs, getLogProbFor, getLogProbFor, getLogScoreFor, getModel, getModels, getMRG, getMRGParams, getNameOfAlgorithm, getNewComponentProbs, getNewParameters, getNewParametersForModel, getNumberOfComponents, getNumericalCharacteristics, getScoreForBestRun, getWeights, initModelForSampling, initWithPrior, isInitialized, isInSamplingMode, iterate, iterate, max, modifyWeights, parseNextParameterSet, parseParameterSet, samplingStopped, setAlpha, setOutputStream, setWeights, swap, toXML, train
 
Methods inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
check, getAlphabetContainer, getLength, getLogProbFor, getLogProbFor, getLogScoreFor, getLogScoreFor, getLogScoreFor, getLogScoreFor, getMaximalMarkovOrder, toString, train
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

MixtureTrainSM

protected MixtureTrainSM(int length,
                         TrainableStatisticalModel[] models,
                         int starts,
                         boolean estimateComponentProbs,
                         double[] componentHyperParams,
                         double[] weights,
                         AbstractMixtureTrainSM.Algorithm algorithm,
                         double alpha,
                         TerminationCondition tc,
                         AbstractMixtureTrainSM.Parameterization parametrization,
                         int initialIteration,
                         int stationaryIteration,
                         BurnInTest burnInTest)
                  throws IllegalArgumentException,
                         WrongAlphabetException,
                         CloneNotSupportedException
Creates a new MixtureTrainSM. This constructor can be used for any algorithm since it takes all necessary values as parameters.

Parameters:
length - the length used in this model
models - the single models building the MixtureTrainSM, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the models that will be adjusted have to implement SamplingComponent
starts - the number of times the algorithm will be started in the train-method, at least 1
estimateComponentProbs - the switch for estimating the component probabilities in the algorithm or to hold them fixed; if the component parameters are fixed, the values of weights will be used, otherwise the componentHyperParams will be incorporated in the adjustment
componentHyperParams - the hyperparameters for the component assignment prior
  • will only be used if estimateComponentProbs == true
  • the array has to be null or has to have length models.length
  • null or an array with all values zero (0) then ML
  • otherwise (all values positive) a prior is used (MAP, MP, ...)
  • depends on the parameterization
weights - null or the weights for the components (then weights.length == models.length)
algorithm - either AbstractMixtureTrainSM.Algorithm.EM or AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
alpha - only for AbstractMixtureTrainSM.Algorithm.EM
the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - only for AbstractMixtureTrainSM.Algorithm.EM
the type of the component probability parameterization;
initialIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the initial sampling phase (at least 1, at most stationaryIteration/starts)
stationaryIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the stationary phase (at least 1) (summed over all starts), i.e. the number of parameter sets that is used for approximation
burnInTest - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the test that will be used to determine the length of the burn-in phase
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of length length
  • dimension < 1
  • weights != null && weights.length != dimension
  • weights != null and it exists an i where weights[i] < 0
  • starts < 1
  • componentHyperParams are not correct
  • the algorithm specific parameters are not correct
WrongAlphabetException - if not all models work on the same alphabet
CloneNotSupportedException - if the models can not be cloned

MixtureTrainSM

public MixtureTrainSM(int length,
                      TrainableStatisticalModel[] models,
                      int starts,
                      double[] componentHyperParams,
                      double alpha,
                      TerminationCondition tc,
                      AbstractMixtureTrainSM.Parameterization parametrization)
               throws IllegalArgumentException,
                      WrongAlphabetException,
                      CloneNotSupportedException
Creates an instance using EM and estimating the component probabilities.

Parameters:
length - the length used in this model
models - the single models building the MixtureTrainSM, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the models that will be adjusted have to implement SamplingComponent
starts - the number of times the algorithm will be started in the train-method, at least 1
componentHyperParams - the hyperparameters for the component assignment prior
  • will only be used if estimateComponentProbs == true
  • the array has to be null or has to have length models.length
  • null or an array with all values zero (0) then ML
  • otherwise (all values positive) a prior is used (MAP, MP, ...)
  • depends on the parameterization
alpha - only for AbstractMixtureTrainSM.Algorithm.EM
the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - only for AbstractMixtureTrainSM.Algorithm.EM
the type of the component probability parameterization
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of length length
  • dimension < 1
  • weights != null && weights.length != dimension
  • weights != null and it exists an i where weights[i] < 0
  • starts < 1
  • componentHyperParams are not correct
  • the algorithm specific parameters are not correct
WrongAlphabetException - if not all models work on the same alphabet
CloneNotSupportedException - if the models can not be cloned
See Also:
MixtureTrainSM(int, de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel[], int, boolean, double[], double[], de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Algorithm, double, TerminationCondition, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Parameterization, int, int, de.jstacs.sampling.BurnInTest), AbstractMixtureTrainSM.Algorithm.EM

MixtureTrainSM

public MixtureTrainSM(int length,
                      TrainableStatisticalModel[] models,
                      double[] weights,
                      int starts,
                      double alpha,
                      TerminationCondition tc,
                      AbstractMixtureTrainSM.Parameterization parametrization)
               throws IllegalArgumentException,
                      WrongAlphabetException,
                      CloneNotSupportedException
Creates an instance using EM and fixed component probabilities.

Parameters:
length - the length used in this model
models - the single models building the MixtureTrainSM, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the models that will be adjusted have to implement SamplingComponent
starts - the number of times the algorithm will be started in the train-method, at least 1
weights - null or the weights for the components (then weights.length == models.length)
alpha - only for AbstractMixtureTrainSM.Algorithm.EM
the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - only for AbstractMixtureTrainSM.Algorithm.EM
the type of the component probability parameterization;
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of length length
  • dimension < 1
  • weights != null && weights.length != dimension
  • weights != null and it exists an i where weights[i] < 0
  • starts < 1
  • componentHyperParams are not correct
  • the algorithm specific parameters are not correct
WrongAlphabetException - if not all models work on the same alphabet
CloneNotSupportedException - if the models can not be cloned
See Also:
MixtureTrainSM(int, de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel[], int, boolean, double[], double[], de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Algorithm, double, TerminationCondition, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Parameterization, int, int, de.jstacs.sampling.BurnInTest), AbstractMixtureTrainSM.Algorithm.EM

MixtureTrainSM

public MixtureTrainSM(int length,
                      TrainableStatisticalModel[] models,
                      int starts,
                      double[] componentHyperParams,
                      int initialIteration,
                      int stationaryIteration,
                      BurnInTest burnInTest)
               throws IllegalArgumentException,
                      WrongAlphabetException,
                      CloneNotSupportedException
Creates an instance using Gibbs Sampling and sampling the component probabilities.

Parameters:
length - the length used in this model
models - the single models building the MixtureTrainSM, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the models that will be adjusted have to implement SamplingComponent
starts - the number of times the algorithm will be started in the train-method, at least 1
componentHyperParams - the hyperparameters for the component assignment prior
  • will only be used if estimateComponentProbs == true
  • the array has to be null or has to have length models.length
  • null or an array with all values zero (0) then ML
  • otherwise (all values positive) a prior is used (MAP, MP, ...)
  • depends on the parameterization
initialIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the initial sampling phase (at least 1, at most stationaryIteration/starts)
stationaryIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the stationary phase (at least 1) (summed over all starts), i.e. the number of parameter sets that is used for approximation
burnInTest - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the test that will be used to determine the length of the burn-in phase
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of length length
  • dimension < 1
  • weights != null && weights.length != dimension
  • weights != null and it exists an i where weights[i] < 0
  • starts < 1
  • componentHyperParams are not correct
  • the algorithm specific parameters are not correct
WrongAlphabetException - if not all models work on the same alphabet
CloneNotSupportedException - if the models can not be cloned
See Also:
MixtureTrainSM(int, de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel[], int, boolean, double[], double[], de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Algorithm, double, TerminationCondition, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Parameterization, int, int, de.jstacs.sampling.BurnInTest), AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING

MixtureTrainSM

public MixtureTrainSM(int length,
                      TrainableStatisticalModel[] models,
                      double[] weights,
                      int starts,
                      int initialIteration,
                      int stationaryIteration,
                      BurnInTest burnInTest)
               throws IllegalArgumentException,
                      WrongAlphabetException,
                      CloneNotSupportedException
Creates an instance using Gibbs Sampling and fixed component probabilities.

Parameters:
length - the length used in this model
models - the single models building the MixtureTrainSM, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the models that will be adjusted have to implement SamplingComponent
starts - the number of times the algorithm will be started in the train-method, at least 1
weights - null or the weights for the components (than weights.length == models.length)
initialIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the initial sampling phase (at least 1, at most stationaryIteration/starts)
stationaryIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the stationary phase (at least 1) (summed over all starts), i.e. the number of parameter sets that is used for approximation
burnInTest - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the test that will be used to determine the length of the burn-in phase
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of length length
  • dimension < 1
  • weights != null && weights.length != dimension
  • weights != null and it exists an i where weights[i] < 0
  • starts < 1
  • componentHyperParams are not correct
  • the algorithm specific parameters are not correct
WrongAlphabetException - if not all models work on the same alphabet
CloneNotSupportedException - if the models can not be cloned
See Also:
MixtureTrainSM(int, de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel[], int, boolean, double[], double[], de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Algorithm, double, TerminationCondition, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Parameterization, int, int, de.jstacs.sampling.BurnInTest), AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING

MixtureTrainSM

public MixtureTrainSM(StringBuffer xml)
               throws NonParsableException
The constructor for the interface Storable. Creates a new MixtureTrainSM out of its XML representation.

Parameters:
xml - the XML representation of the model as StringBuffer
Throws:
NonParsableException - if the StringBuffer is not parsable
Method Detail

emitDataSetUsingCurrentParameterSet

protected Sequence[] emitDataSetUsingCurrentParameterSet(int n,
                                                         int... lengths)
                                                  throws Exception
Description copied from class: AbstractMixtureTrainSM
The method returns an array of sequences using the current parameter set.

Specified by:
emitDataSetUsingCurrentParameterSet in class AbstractMixtureTrainSM
Parameters:
n - the number of sequences to be sampled
lengths - the corresponding lengths
Returns:
an array of sequences
Throws:
Exception - if it was impossible to sample the sequences
See Also:
StatisticalModel.emitDataSet(int, int...)

doFirstIteration

protected double[][] doFirstIteration(double[] dataWeights,
                                      MultivariateRandomGenerator m,
                                      MRGParams[] params)
                               throws Exception
Description copied from class: AbstractMixtureTrainSM
This method will do the first step in the train algorithm for the current model on the internal data set. The initialization will be done by randomly setting the component membership. This is useful when nothing is known about the problem.

Specified by:
doFirstIteration in class AbstractMixtureTrainSM
Parameters:
dataWeights - null or the weights of each element of the data set
m - the multivariate random generator
params - the parameters for the multivariate random generator
Returns:
the weighting array used to initialize, this array can be reused in the following iterations
Throws:
Exception - if something went wrong

doFirstIteration

public double[][] doFirstIteration(DataSet data,
                                   double[] dataWeights,
                                   double[][] partitioning)
                            throws Exception
This method enables you to train a mixture model with a fixed start partitioning. This is useful to compare implementations or if one has a hypothesis how the components should look like.

Parameters:
data - the data set of sequences
dataWeights - null or the weights of each element of the data set
partitioning - a kind of partitioning
  1. partitioning.length has to be data.getNumberofElements()
  2. for all i: partitioning[i].length has to be getNumberOfModels()
  3. $\forall i:\;\sum_j partitioning[i][j] \stackrel{!}{=}1$
Returns:
the weighting array used to initialize, this array can be reused in the following iterations
Throws:
Exception - if something went wrong or if the number of components is 1

getLogProbUsingCurrentParameterSetFor

protected double getLogProbUsingCurrentParameterSetFor(int component,
                                                       Sequence s,
                                                       int start,
                                                       int end)
                                                throws Exception
Description copied from class: AbstractMixtureTrainSM
Returns the logarithmic probability for the sequence and the given component using the current parameter set.

Specified by:
getLogProbUsingCurrentParameterSetFor in class AbstractMixtureTrainSM
Parameters:
component - the index of the component
s - the sequence
start - the start position in the sequence
end - the end position in the sequence
Returns:
log P(s,component) = log P(s|component) + log P(component)
Throws:
Exception - if not trained yet or something else went wrong
See Also:
AbstractMixtureTrainSM.getNumberOfComponents()

toString

public String toString(NumberFormat nf)
Description copied from interface: SequenceScore
This method returns a String representation of the instance.

Parameters:
nf - the NumberFormat for the String representation of parameters or probabilities
Returns:
a String representation of the instance

getNewWeights

protected double getNewWeights(double[] dataWeights,
                               double[] w,
                               double[][] seqweights)
                        throws Exception
Computes sequence weights and returns the score.

Specified by:
getNewWeights in class AbstractMixtureTrainSM
Parameters:
dataWeights - the weights for the internal data set (should not be changed)
w - the array for the statistic of the component parameters (shall be filled)
seqweights - an array containing for each component the weights for each sequence (shall be filled)
Returns:
the score
Throws:
Exception - if something went wrong

setTrainData

protected void setTrainData(DataSet data)
Description copied from class: AbstractMixtureTrainSM
This method is invoked by the train-method and sets for a given data set the data set that should be used for train.

Specified by:
setTrainData in class AbstractMixtureTrainSM
Parameters:
data - the given data set of sequences