de.jstacs.sequenceScores.statisticalModels.trainable.mixture
Class AbstractMixtureTrainSM

java.lang.Object
  extended by de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
      extended by de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
All Implemented Interfaces:
SequenceScore, StatisticalModel, TrainableStatisticalModel, Storable, Cloneable
Direct Known Subclasses:
HiddenMotifMixture, MixtureTrainSM, StrandTrainSM

public abstract class AbstractMixtureTrainSM
extends AbstractTrainableStatisticalModel

This is the abstract class for all kinds of mixture models. It enables the user to train the parameters using AbstractMixtureTrainSM.Algorithm.EM or AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING. If this instance is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the internal models that will be adjusted have to implement SamplingComponent. If you use Gibbs sampling temporary files will be created in the Java temp folder. These files will be deleted if no reference to the current instance exists and the Garbage Collector is called. Therefore it is recommended to call the Garbage Collector explicitly at the end of any application.

The model stores a reference to the last data set used in train. This enables the user to estimate the parameters iteratively beginning with the current set of parameters. Therefore you can use the method continueIterations(double[], double[][], int, int) .

The method setOutputStream(OutputStream) enables the user to get comments from the train(DataSet, double[]) method or to repress them.

The method getScoreForBestRun() enables the user to optimize different instances of the same model ( clone()) using the EM-algorithm on different CPUs, to compare the results and to select the best trained model. This might be useful to get the results faster (measured in real time).

The reference to the internal data set is not stored if the model is stored in a StringBuffer. So you can use these methods only after training the parameters after (re)creating a model.

Author:
Jens Keilwagen, Berit Haldemann
See Also:
SamplingComponent, System.gc()

Nested Class Summary
static class AbstractMixtureTrainSM.Algorithm
          This enum defines the different types of algorithms that can be used in an AbstractMixtureTrainSM.
static class AbstractMixtureTrainSM.Parameterization
          This enum defines the different types of parameterization for a probability that can be used in an AbstractMixtureTrainSM.
 
Field Summary
protected  AbstractMixtureTrainSM.Algorithm algorithm
          The type of algorithm.
protected  boolean algorithmHasBeenRun
          A switch which indicates that the algorithm for determining the parameters has been run.
protected  TrainableStatisticalModel[] alternativeModel
          The alternative models for the EM.
protected  double best
          This field contains the value of objective function of the best start of the training.
protected  BurnInTest burnInTest
          The BurnInTest that is used to stop the sampling.
protected  double[] componentHyperParams
          The hyperparameters for estimating the probabilities of the components.
protected  double[] compProb
          This array is used while training to avoid creating many new objects.
protected  int[] counter
          The current index of the parameter set while adjustment (optimization).
protected  int dimension
          The number of dimensions.
protected  boolean estimateComponentProbs
          The switch for estimating the component probabilities or not.
protected  File[] file
          The file in which the component probabilities are stored.
protected  BufferedReader filereader
          Reading component probabilities from a file.
protected  BufferedWriter filewriter
          Saving component probabilities in a file.
protected  int initialIteration
          The number of initial iterations.
protected  double[] logWeights
          The log probabilities for each component.
protected  TrainableStatisticalModel[] model
          The model for the sequences.
protected  boolean[] optimizeModel
          A switch for each model whether to optimize/adjust or not.
protected  DataSet[] sample
          The data set that was used in the last training.
protected  int samplingIndex
          The current index of the sampling.
protected  double[][] seqWeights
          The weights of the (sub-)sequence used to train the components (internal models).
protected  SafeOutputStream sostream
          This is the stream for writing information while training.
protected  int starts
          The number of starts.
protected  int stationaryIteration
          The number of (stationary) iterations of the Gibbs Sampler.
protected  double[] weights
          The probabilities for each component.
 
Fields inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
alphabets, length
 
Constructor Summary
protected AbstractMixtureTrainSM(int length, TrainableStatisticalModel[] models, boolean[] optimizeModel, int dimension, int starts, boolean estimateComponentProbs, double[] componentHyperParams, double[] weights, AbstractMixtureTrainSM.Algorithm algorithm, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization, int initialIteration, int stationaryIteration, BurnInTest burnInTest)
          Creates a new AbstractMixtureTrainSM.
protected AbstractMixtureTrainSM(StringBuffer xml)
          The standard constructor for the interface Storable.
 
Method Summary
 boolean algorithmHasBeenRun()
          This method indicates whether the parameters of the model has been determined by the internal algorithm.
protected  void checkLength(int index, int l)
          This method checks if the length l of the model with index index is capable for the current instance.
protected  void checkModelsForGibbsSampling()
          This method can be used to check whether the necessary models have implemented the SamplingComponent.
 AbstractMixtureTrainSM clone()
          Follows the conventions of Object's clone()-method.
protected  double continueIterations(double[] dataWeights, double[][] seqweights)
          This method will run the train algorithm for the current model on the internal data set.
protected  double continueIterations(double[] dataWeights, double[][] seqweights, int iterations, int start)
          This method will run the train algorithm for the current model on the internal sample.
protected  double[][] createSeqWeightsArray()
          Creates an array that can be used for weighting sequences in the algorithm.
protected  double[][] doFirstIteration(DataSet data, double[] dataWeights)
          This method will do the first step in the train algorithm for the current model.
protected  double[][] doFirstIteration(DataSet data, double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method will do the first step in the train algorithm for the current model.
protected abstract  double[][] doFirstIteration(double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method will do the first step in the train algorithm for the current model on the internal data set.
static int draw(double[] w, int start)
          This method draws an index of an array corresponding to the probabilities encoded in the entries of the array.
 DataSet emitDataSet(int n, int... lengths)
          This method returns a DataSet object containing artificial sequence(s).
protected abstract  Sequence[] emitDataSetUsingCurrentParameterSet(int n, int... lengths)
          The method returns an array of sequences using the current parameter set.
protected  void extendSampling(int sampling)
          This method prepares the model to extend an existing sampling.
protected  void extractFurtherInformation(StringBuffer xml)
          This method is used in the subclasses to extract further information from the XML representation and to set these as values of the instance.
protected  void finalize()
           
protected  void fromXML(StringBuffer representation)
          This method should only be used by the constructor that works on a StringBuffer.
 ResultSet getCharacteristics()
          Returns some information characterizing or describing the current instance.
protected  StringBuffer getFurtherInformation()
          This method is used in the subclasses to append further information to the XML representation.
 int getIndexOfMaximalComponentFor(Sequence s)
          Returns the index i of the component with P(i|s) maximal.
 String getInstanceName()
          Should return a short instance name such as iMM(0), BN(2), ...
 double getLogPriorTerm()
          Returns a value that is proportional to the log of the prior.
protected  double getLogPriorTermForComponentProbs()
          This method computes the part of the prior that comes from the component probabilities.
 double getLogProbFor(int component, Sequence s)
          Returns the logarithmic probability for the sequence and the given component.
 double getLogProbFor(Sequence sequence, int startpos, int endpos)
          Returns the logarithm of the probability of (a part of) the given sequence given the model.
protected abstract  double getLogProbUsingCurrentParameterSetFor(int component, Sequence s, int start, int end)
          Returns the logarithmic probability for the sequence and the given component using the current parameter set.
 double[] getLogScoreFor(DataSet data)
          This method computes the logarithm of the scores of all sequences in the given data set.
 TrainableStatisticalModel getModel(int i)
          Returns a deep copy of the i-th model.
 TrainableStatisticalModel[] getModels()
          Returns a deep copy of the models.
protected  MultivariateRandomGenerator getMRG()
          This method creates the multivariate random generator that will be used during initialization.
protected  MRGParams getMRGParams()
          This method creates the parameters used in a multivariate random generator while initialization.
 String getNameOfAlgorithm()
          Returns the name of the used algorithm.
protected  void getNewComponentProbs(double[] weights)
          Estimates the weights of each component.
protected  void getNewParameters(int iteration, double[][] seqWeights, double[] w)
          This method trains the internal models on the internal data set and the given weights.
protected  void getNewParametersForModel(int modelIndex, int iteration, int sampleIndex, double[] seqWeights)
          This method trains the internal model with index modelIndex on the internal data set and the given weights.
protected abstract  double getNewWeights(double[] dataWeights, double[] w, double[][] seqweights)
          Computes sequence weights and returns the score.
 int getNumberOfComponents()
          Returns the number of components the are modeled by this AbstractMixtureTrainSM.
 NumericalResultSet getNumericalCharacteristics()
          Returns the subset of numerical values that are also returned by SequenceScore.getCharacteristics().
 double getScoreForBestRun()
          Returns the value of the optimized function from the best run of the last training.
 double[] getWeights()
          This method returns a deep copy of the weights for each component.
protected  void initModelForSampling(int starts)
          This method initializes the model for the sampling.
protected  void initWithPrior(double[] w)
          This method sets the initial weights before counting the usage of each component.
 boolean isInitialized()
          This method can be used to determine whether the instance is initialized.
protected  boolean isInSamplingMode()
          This method returns true if the object is currently used in a sampling, otherwise false.
 double iterate(DataSet data, double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method runs the train algorithm for the current model.
protected  double iterate(int start, double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method runs the train algorithm for the current model and the internal data set.
static int max(double[] w, int start, int end)
          This method returns the index of a maximal entry in the array w between index start and end.
protected  double modifyWeights(double[] w)
          This method modifies the computed weights for one sequence and returns the score.
protected  boolean parseNextParameterSet()
          This method allows the user to parse the next set of parameters (from a file).
protected  boolean parseParameterSet(int sampling, int burnInIteration)
          This method allows the user to parse the set of parameters with index burnInIteration of a specific sampling (from a file).
protected  void samplingStopped()
          This method is the opposite of the method initModelForSampling(int).
 void setAlpha(double alpha)
          Sets the parameter of the Dirichlet distribution which is used when you invoke train to init the gammas.
 void setOutputStream(OutputStream o)
          Sets the OutputStream that is used e.g.
protected abstract  void setTrainData(DataSet data)
          This method is invoked by the train-method and sets for a given data set the data set that should be used for train.
protected  void setWeights(double... weights)
          Sets the weights of each component.
protected  void swap()
          This method swaps the current component models with the alternative model.
 StringBuffer toXML()
          This method returns an XML representation as StringBuffer of an instance of the implementing class.
 void train(DataSet data, double[] dataWeights)
          Trains the TrainableStatisticalModel object given the data as DataSet using the specified weights.
 
Methods inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
check, getAlphabetContainer, getLength, getLogProbFor, getLogProbFor, getLogScoreFor, getLogScoreFor, getLogScoreFor, getLogScoreFor, getMaximalMarkovOrder, toString, train
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface de.jstacs.sequenceScores.SequenceScore
toString
 

Field Detail

weights

protected double[] weights
The probabilities for each component.


logWeights

protected double[] logWeights
The log probabilities for each component.


componentHyperParams

protected double[] componentHyperParams
The hyperparameters for estimating the probabilities of the components.


model

protected TrainableStatisticalModel[] model
The model for the sequences.


alternativeModel

protected TrainableStatisticalModel[] alternativeModel
The alternative models for the EM.


starts

protected int starts
The number of starts.


dimension

protected int dimension
The number of dimensions.


best

protected double best
This field contains the value of objective function of the best start of the training.


sostream

protected SafeOutputStream sostream
This is the stream for writing information while training.


sample

protected DataSet[] sample
The data set that was used in the last training. Will not be stored in the StringBuffer when invoking toXML().


estimateComponentProbs

protected boolean estimateComponentProbs
The switch for estimating the component probabilities or not.


optimizeModel

protected boolean[] optimizeModel
A switch for each model whether to optimize/adjust or not.


algorithm

protected AbstractMixtureTrainSM.Algorithm algorithm
The type of algorithm.


algorithmHasBeenRun

protected boolean algorithmHasBeenRun
A switch which indicates that the algorithm for determining the parameters has been run.


initialIteration

protected int initialIteration
The number of initial iterations.


stationaryIteration

protected int stationaryIteration
The number of (stationary) iterations of the Gibbs Sampler.


burnInTest

protected BurnInTest burnInTest
The BurnInTest that is used to stop the sampling.


filewriter

protected BufferedWriter filewriter
Saving component probabilities in a file.


filereader

protected BufferedReader filereader
Reading component probabilities from a file.


file

protected File[] file
The file in which the component probabilities are stored.


counter

protected int[] counter
The current index of the parameter set while adjustment (optimization).


samplingIndex

protected int samplingIndex
The current index of the sampling.


compProb

protected double[] compProb
This array is used while training to avoid creating many new objects.


seqWeights

protected double[][] seqWeights
The weights of the (sub-)sequence used to train the components (internal models). The first dimension is used for the models, the second for the (sub-)sequences.

Constructor Detail

AbstractMixtureTrainSM

protected AbstractMixtureTrainSM(int length,
                                 TrainableStatisticalModel[] models,
                                 boolean[] optimizeModel,
                                 int dimension,
                                 int starts,
                                 boolean estimateComponentProbs,
                                 double[] componentHyperParams,
                                 double[] weights,
                                 AbstractMixtureTrainSM.Algorithm algorithm,
                                 double alpha,
                                 TerminationCondition tc,
                                 AbstractMixtureTrainSM.Parameterization parametrization,
                                 int initialIteration,
                                 int stationaryIteration,
                                 BurnInTest burnInTest)
                          throws CloneNotSupportedException,
                                 IllegalArgumentException,
                                 WrongAlphabetException
Creates a new AbstractMixtureTrainSM. This constructor can be used for any algorithm since it takes all necessary values as parameters.

Parameters:
length - the length used in this model
models - the single models building the AbstractMixtureTrainSM, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the models that will be adjusted have to implement SamplingComponent
optimizeModel - an array of switches to determine whether a model should be optimized or not
dimension - the number of components
starts - the number of times the algorithm will be started in the train-method, at least 1
estimateComponentProbs - the switch for estimating the component probabilities in the algorithm or to hold them fixed; if the component parameters are fixed, the values of weights will be used, otherwise the componentHyperParams will be incorporated in the adjustment
componentHyperParams - the hyperparameters for the component assignment prior
  • will only be used if estimateComponentProbs == true
  • the array has to be null or has to have length dimension
  • null or an array with all values zero (0) then ML
  • otherwise (all values positive) a prior is used (MAP, MP, ...)
  • depends on the parameterization
weights - null or the weights for the components (then weights.length == dimension)
algorithm - either AbstractMixtureTrainSM.Algorithm.EM or AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
alpha - only for AbstractMixtureTrainSM.Algorithm.EM
the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - only for AbstractMixtureTrainSM.Algorithm.EM
the type of the component probability parameterization;

AbstractMixtureTrainSM

protected AbstractMixtureTrainSM(StringBuffer xml)
                          throws NonParsableException
The standard constructor for the interface Storable. Creates a new AbstractMixtureTrainSM out of its XML representation.

Parameters:
xml - the XML representation of the model as StringBuffer
Throws:
NonParsableException - if the StringBuffer can not be parsed
Method Detail

clone

public AbstractMixtureTrainSM clone()
                             throws CloneNotSupportedException
Description copied from class: AbstractTrainableStatisticalModel
Follows the conventions of Object's clone()-method.

Specified by:
clone in interface SequenceScore
Specified by:
clone in interface TrainableStatisticalModel
Overrides:
clone in class AbstractTrainableStatisticalModel
Returns:
an object, that is a copy of the current AbstractTrainableStatisticalModel (the member-AlphabetContainer isn't deeply cloned since it is assumed to be immutable). The type of the returned object is defined by the class X directly inherited from AbstractTrainableStatisticalModel. Hence X's clone()-method should work as:
1. Object o = (X)super.clone();
2. all additional member variables of o defined by X that are not of simple data-types like int, double, ... have to be deeply copied
3. return o
Throws:
CloneNotSupportedException - if something went wrong while cloning

getMRG

protected MultivariateRandomGenerator getMRG()
This method creates the multivariate random generator that will be used during initialization.

Returns:
a multivariate random generator
See Also:
getMRGParams()

getMRGParams

protected MRGParams getMRGParams()
This method creates the parameters used in a multivariate random generator while initialization.

Returns:
the parameters for the multivariate random generator
See Also:
getMRG()

train

public void train(DataSet data,
                  double[] dataWeights)
           throws Exception
Description copied from interface: TrainableStatisticalModel
Trains the TrainableStatisticalModel object given the data as DataSet using the specified weights. The weight at position i belongs to the element at position i. So the array weight should have the number of sequences in the data set as dimension. (Optionally it is possible to use weight == null if all weights have the value one.)
This method should work non-incrementally. That means the result of the following series: train(data1); train(data2) should be a fully trained model over data2 and not over data1+data2. All parameters of the model were given by the call of the constructor.

Parameters:
data - the given sequences as DataSet
dataWeights - the weights of the elements, each weight should be non-negative
Throws:
Exception - if the training did not succeed (e.g. the dimension of weights and the number of sequences in the data set do not match)
See Also:
DataSet.getElementAt(int), DataSet.ElementEnumerator

swap

protected void swap()
This method swaps the current component models with the alternative model.

This method should NOT be made public and should ONLY be used in the train-method.


setTrainData

protected abstract void setTrainData(DataSet data)
                              throws Exception
This method is invoked by the train-method and sets for a given data set the data set that should be used for train.

Parameters:
data - the given data set of sequences
Throws:
Exception - if something went wrong

createSeqWeightsArray

protected double[][] createSeqWeightsArray()
Creates an array that can be used for weighting sequences in the algorithm.

Returns:
an array that can be used for weighting sequences in the algorithm

iterate

public double iterate(DataSet data,
                      double[] dataWeights,
                      MultivariateRandomGenerator m,
                      MRGParams[] params)
               throws Exception
This method runs the train algorithm for the current model.

Parameters:
data - the data set of sequences
dataWeights - the weights for each sequence or null
m - the random generator for initiating the algorithm
params - the parameters for the sequences
Returns:
the score
Throws:
Exception - if something went wrong
See Also:
doFirstIteration(DataSet, double[], MultivariateRandomGenerator, MRGParams[]), continueIterations(double[], double[][]), continueIterations(double[], double[][], int, int)

iterate

protected double iterate(int start,
                         double[] dataWeights,
                         MultivariateRandomGenerator m,
                         MRGParams[] params)
                  throws Exception
This method runs the train algorithm for the current model and the internal data set.

Parameters:
start - the index of the training
dataWeights - the weights for each sequence or null
m - the random generator for initiating the algorithm
params - the parameters for the sequences
Returns:
the score
Throws:
Exception - if something went wrong
See Also:
doFirstIteration(DataSet, double[], MultivariateRandomGenerator, MRGParams[]), continueIterations(double[], double[][]), continueIterations(double[], double[][], int, int)

doFirstIteration

protected double[][] doFirstIteration(DataSet data,
                                      double[] dataWeights)
                               throws Exception
This method will do the first step in the train algorithm for the current model. The initialization will be done by randomly setting the component membership. This is useful when nothing is known about the problem.

Parameters:
data - the data set of sequences
dataWeights - null or the weights of each element of the data set
Returns:
the weighting array used to initialize, this array can be reused in the following iterations
Throws:
Exception - if something went wrong

doFirstIteration

protected double[][] doFirstIteration(DataSet data,
                                      double[] dataWeights,
                                      MultivariateRandomGenerator m,
                                      MRGParams[] params)
                               throws Exception
This method will do the first step in the train algorithm for the current model. The initialization will be done by randomly setting the component membership. This is useful when nothing is known about the problem.

Parameters:
data - the data set of sequences
dataWeights - null or the weights of each element of the data set
m - the multivariate random generator
params - the parameters for the multivariate random generator
Returns:
the weighting array used to initialize, this array can be reused in the following iterations
Throws:
Exception - if something went wrong

doFirstIteration

protected abstract double[][] doFirstIteration(double[] dataWeights,
                                               MultivariateRandomGenerator m,
                                               MRGParams[] params)
                                        throws Exception
This method will do the first step in the train algorithm for the current model on the internal data set. The initialization will be done by randomly setting the component membership. This is useful when nothing is known about the problem.

Parameters:
dataWeights - null or the weights of each element of the data set
m - the multivariate random generator
params - the parameters for the multivariate random generator
Returns:
the weighting array used to initialize, this array can be reused in the following iterations
Throws:
Exception - if something went wrong

continueIterations

protected double continueIterations(double[] dataWeights,
                                    double[][] seqweights)
                             throws Exception
This method will run the train algorithm for the current model on the internal data set. The initialization will be done by using the models of the AbstractMixtureTrainSM. So in this case the models have to be trained already. This method is useful for restarting the train algorithm at a certain point. The algorithm will stop if the difference between the optimized functions for two iterations is smaller than the specified threshold.

If the difference becomes significant negative an exception is thrown.

Parameters:
dataWeights - null or the weights of each element of the internal data set (last data set the AbstractMixtureTrainSM was trained on)
seqweights - null or an array for weighting the sequences, see createSeqWeightsArray()
Returns:
a score for the model
Throws:
Exception - if something went wrong

continueIterations

protected double continueIterations(double[] dataWeights,
                                    double[][] seqweights,
                                    int iterations,
                                    int start)
                             throws Exception
This method will run the train algorithm for the current model on the internal sample. The initialization will be done by using the models of the AbstractMixtureTrainSM. So in this case the models have to be trained already. This method is useful for restarting the algorithm at a certain point. The algorithm will stop after the number of iterations.

Parameters:
dataWeights - null or the weights of each element of the internal sample (last sample the AbstractMixtureTrainSM was trained on)
seqweights - null or an array for weighting the sequences, see createSeqWeightsArray()
iterations - the number of iterations that should be done
start - the index of the run in a TrainableStatisticalModel.train(DataSet)-call
Returns:
the current score (likelihood or posterior)
Throws:
Exception - if something went wrong

getNewParameters

protected void getNewParameters(int iteration,
                                double[][] seqWeights,
                                double[] w)
                         throws Exception
This method trains the internal models on the internal data set and the given weights.

Parameters:
iteration - the number of times this method has been invoked
seqWeights - the weights for each model and sequence
w - the weights for the components
Throws:
Exception - if the training of the internal models went wrong

getNewParametersForModel

protected void getNewParametersForModel(int modelIndex,
                                        int iteration,
                                        int sampleIndex,
                                        double[] seqWeights)
                                 throws Exception
This method trains the internal model with index modelIndex on the internal data set and the given weights.

Parameters:
modelIndex - the index of the model
iteration - the number of times this method has been invoked for this model
sampleIndex - the index of the internal data set that should be used
seqWeights - the weights for each sequence
Throws:
Exception - if the training of the internal model went wrong

getNewWeights

protected abstract double getNewWeights(double[] dataWeights,
                                        double[] w,
                                        double[][] seqweights)
                                 throws Exception
Computes sequence weights and returns the score.

Parameters:
dataWeights - the weights for the internal data set (should not be changed)
w - the array for the statistic of the component parameters (shall be filled)
seqweights - an array containing for each component the weights for each sequence (shall be filled)
Returns:
the score
Throws:
Exception - if something went wrong

modifyWeights

protected double modifyWeights(double[] w)
This method modifies the computed weights for one sequence and returns the score.

Parameters:
w - the weights
Returns:
the score

initWithPrior

protected void initWithPrior(double[] w)
This method sets the initial weights before counting the usage of each component. For ML the weights are set to 0 and for MAP they are set to the component hyperparameters.

Parameters:
w - the array of weights

getLogProbFor

public double getLogProbFor(int component,
                            Sequence s)
                     throws Exception
Returns the logarithmic probability for the sequence and the given component.

Parameters:
component - the index of the component
s - the sequence
Returns:
log P(s,component) = log P(s|component) + log P(component)
Throws:
Exception - if the model was not trained yet or something else went wrong
See Also:
getNumberOfComponents()

getLogProbUsingCurrentParameterSetFor

protected abstract double getLogProbUsingCurrentParameterSetFor(int component,
                                                                Sequence s,
                                                                int start,
                                                                int end)
                                                         throws Exception
Returns the logarithmic probability for the sequence and the given component using the current parameter set.

Parameters:
component - the index of the component
s - the sequence
start - the start position in the sequence
end - the end position in the sequence
Returns:
log P(s,component) = log P(s|component) + log P(component)
Throws:
Exception - if not trained yet or something else went wrong
See Also:
getNumberOfComponents()

getLogProbFor

public final double getLogProbFor(Sequence sequence,
                                  int startpos,
                                  int endpos)
                           throws Exception
Description copied from interface: StatisticalModel
Returns the logarithm of the probability of (a part of) the given sequence given the model. If at least one random variable is continuous the value of density function is returned.

It extends the possibility given by the method StatisticalModel.getLogProbFor(Sequence, int) by the fact, that the model could be e.g. homogeneous and therefore the length of the sequences, whose probability should be returned, is not fixed. Additionally, the end position of the part of the given sequence is given and the probability of the part from position startpos to endpos (inclusive) should be returned.
The length and the alphabets define the type of data that can be modeled and therefore both has to be checked.

Parameters:
sequence - the given sequence
startpos - the start position within the given sequence
endpos - the last position to be taken into account
Returns:
the logarithm of the probability or the value of the density function of (the part of) the given sequence given the model
Throws:
Exception - if the sequence could not be handled (e.g. startpos > , endpos > sequence.length, ...) by the model
NotTrainedException - if the model is not trained yet

getLogScoreFor

public final double[] getLogScoreFor(DataSet data)
                              throws Exception
Description copied from interface: SequenceScore
This method computes the logarithm of the scores of all sequences in the given data set. The values are stored in an array according to the index of the respective sequence in the data set.

The score for any sequence shall be computed independent of all other sequences in the data set. So the result should be exactly the same as for the method SequenceScore.getLogScoreFor(Sequence).

Specified by:
getLogScoreFor in interface SequenceScore
Overrides:
getLogScoreFor in class AbstractTrainableStatisticalModel
Parameters:
data - the data set of sequences
Returns:
an array containing the logarithm of the score of all sequences of the data set
Throws:
Exception - if something went wrong
See Also:
SequenceScore.getLogScoreFor(Sequence)

getLogPriorTerm

public double getLogPriorTerm()
                       throws Exception
Description copied from interface: StatisticalModel
Returns a value that is proportional to the log of the prior. For maximum likelihood (ML) 0 should be returned.

Returns:
a value that is proportional to the log of the prior
Throws:
Exception - if something went wrong

getLogPriorTermForComponentProbs

protected final double getLogPriorTermForComponentProbs()
This method computes the part of the prior that comes from the component probabilities.

Returns:
the part of the prior that comes from the component probabilities

getScoreForBestRun

public final double getScoreForBestRun()
                                throws NotTrainedException,
                                       OperationNotSupportedException
Returns the value of the optimized function from the best run of the last training.

Returns:
the value of the optimized function from the best run of the last training
Throws:
NotTrainedException - if the training algorithm has not been run
OperationNotSupportedException - if this method is used for an instance that does not use the EM
See Also:
train(DataSet, double[]), algorithmHasBeenRun()

getInstanceName

public String getInstanceName()
Description copied from interface: SequenceScore
Should return a short instance name such as iMM(0), BN(2), ...

Returns:
a short instance name

getIndexOfMaximalComponentFor

public int getIndexOfMaximalComponentFor(Sequence s)
                                  throws Exception
Returns the index i of the component with P(i|s) maximal. Therefore it computes
%preamble{\usepackage{amsmath}}
This method can be helpful for clustering.

Parameters:
s - the sequence
Returns:
the index of the component
Throws:
Exception - if the model was not trained yet or something else went wrong
See Also:
getLogProbFor(int, Sequence)

getModels

public final TrainableStatisticalModel[] getModels()
                                            throws CloneNotSupportedException
Returns a deep copy of the models.

Returns:
an array of AbstractTrainableStatisticalModels
Throws:
CloneNotSupportedException - if at least one model can not be cloned
See Also:
getModel(int)

getModel

public final TrainableStatisticalModel getModel(int i)
                                         throws CloneNotSupportedException
Returns a deep copy of the i-th model.

Parameters:
i - the index
Returns:
a deep copy of the i-th model
Throws:
CloneNotSupportedException - if at least one model can not be cloned
See Also:
getModels()

getNameOfAlgorithm

public String getNameOfAlgorithm()
Returns the name of the used algorithm.

Returns:
the name of the used algorithm

getNumberOfComponents

public final int getNumberOfComponents()
Returns the number of components the are modeled by this AbstractMixtureTrainSM.

Returns:
the number of components

getCharacteristics

public ResultSet getCharacteristics()
                             throws Exception
Description copied from interface: SequenceScore
Returns some information characterizing or describing the current instance. This could be e.g. the number of edges for a Bayesian network or an image showing some representation of the instance. The set of characteristics should always include the XML-representation of the instance. The corresponding result type is StorableResult.

Specified by:
getCharacteristics in interface SequenceScore
Overrides:
getCharacteristics in class AbstractTrainableStatisticalModel
Returns:
the characteristics of the current instance
Throws:
Exception - if some of the characteristics could not be defined
See Also:
StorableResult

getNumericalCharacteristics

public NumericalResultSet getNumericalCharacteristics()
                                               throws Exception
Description copied from interface: SequenceScore
Returns the subset of numerical values that are also returned by SequenceScore.getCharacteristics().

Returns:
the numerical characteristics of the current instance
Throws:
Exception - if some of the characteristics could not be defined

getWeights

public final double[] getWeights()
This method returns a deep copy of the weights for each component.

Returns:
the weight for each component

algorithmHasBeenRun

public boolean algorithmHasBeenRun()
This method indicates whether the parameters of the model has been determined by the internal algorithm.

Returns:
true if the internal algorithm has been used to determine the parameters of the model

isInitialized

public boolean isInitialized()
Description copied from interface: SequenceScore
This method can be used to determine whether the instance is initialized. If the instance is initialized you should be able to invoke SequenceScore.getLogScoreFor(Sequence).

Returns:
true if the instance is initialized, false otherwise

setAlpha

public final void setAlpha(double alpha)
                    throws IllegalArgumentException
Sets the parameter of the Dirichlet distribution which is used when you invoke train to init the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).

Parameters:
alpha - the parameter of the Dirichlet distribution with alpha > 0
Throws:
IllegalArgumentException - if alpha <= 0

setOutputStream

public final void setOutputStream(OutputStream o)
Sets the OutputStream that is used e.g. for writing information while training. It is possible to set o=null, than nothing will be written.

Parameters:
o - the OutputStream

getNewComponentProbs

protected void getNewComponentProbs(double[] weights)
                             throws Exception
Estimates the weights of each component.

Parameters:
weights - the array of weights, every element has to be non-negative and the dimension has to be dimension
Throws:
Exception - a weight is less than 0
See Also:
getNumberOfComponents()

setWeights

protected void setWeights(double... weights)
                   throws IllegalArgumentException
Sets the weights of each component.

Parameters:
weights - every element has to be non-negative, the sum of all weights has to be 1 and the dimension of weights has to be dimension
Throws:
IllegalArgumentException - a weight is less than 0, the sum is not equal to 1 or the dimension is incorrect
See Also:
getNumberOfComponents()

toXML

public StringBuffer toXML()
Description copied from interface: Storable
This method returns an XML representation as StringBuffer of an instance of the implementing class.

Returns:
the XML representation

getFurtherInformation

protected StringBuffer getFurtherInformation()
This method is used in the subclasses to append further information to the XML representation.

Returns:
a part of the XML representation
See Also:
extractFurtherInformation(StringBuffer)

fromXML

protected void fromXML(StringBuffer representation)
                throws NonParsableException
Description copied from class: AbstractTrainableStatisticalModel
This method should only be used by the constructor that works on a StringBuffer. It is the counter part of Storable.toXML().

Specified by:
fromXML in class AbstractTrainableStatisticalModel
Parameters:
representation - the XML representation of the model
Throws:
NonParsableException - if the StringBuffer is not parsable or the representation is conflicting
See Also:
AbstractTrainableStatisticalModel.AbstractTrainableStatisticalModel(StringBuffer)

extractFurtherInformation

protected void extractFurtherInformation(StringBuffer xml)
                                  throws NonParsableException
This method is used in the subclasses to extract further information from the XML representation and to set these as values of the instance.

Parameters:
xml - the XML representation
Throws:
NonParsableException - if the XML representation is not parsable
See Also:
getFurtherInformation()

checkModelsForGibbsSampling

protected void checkModelsForGibbsSampling()
This method can be used to check whether the necessary models have implemented the SamplingComponent.


checkLength

protected void checkLength(int index,
                           int l)
This method checks if the length l of the model with index index is capable for the current instance. Otherwise an IllegalArgumentException is thrown.

Parameters:
index - the index of the model
l - the length of the model
Throws:
IllegalArgumentException - if the model instance can not be used

emitDataSet

public DataSet emitDataSet(int n,
                           int... lengths)
                    throws Exception
Description copied from interface: StatisticalModel
This method returns a DataSet object containing artificial sequence(s).

There are two different possibilities to create a data set for a model with length 0 (homogeneous models).
  1. emitDataSet( int n, int l ) should return a data set with n sequences of length l.
  2. emitDataSet( int n, int[] l ) should return a data set with n sequences which have a sequence length corresponding to the entry in the given array l.

There are two different possibilities to create a data set for a model with length greater than 0 (inhomogeneous models).
emitDataSet( int n ) and emitDataSet( int n, null ) should return a data set with n sequences of length of the model ( SequenceScore.getLength()).

The standard implementation throws an Exception.

Specified by:
emitDataSet in interface StatisticalModel
Overrides:
emitDataSet in class AbstractTrainableStatisticalModel
Parameters:
n - the number of sequences that should be contained in the returned data set
lengths - the length of the sequences for a homogeneous model; for an inhomogeneous model this parameter should be null or an array of size 0.
Returns:
a DataSet containing the artificial sequence(s)
Throws:
Exception - if the emission did not succeed
NotTrainedException - if the model is not trained yet
See Also:
DataSet

emitDataSetUsingCurrentParameterSet

protected abstract Sequence[] emitDataSetUsingCurrentParameterSet(int n,
                                                                  int... lengths)
                                                           throws Exception
The method returns an array of sequences using the current parameter set.

Parameters:
n - the number of sequences to be sampled
lengths - the corresponding lengths
Returns:
an array of sequences
Throws:
Exception - if it was impossible to sample the sequences
See Also:
StatisticalModel.emitDataSet(int, int...)

parseParameterSet

protected boolean parseParameterSet(int sampling,
                                    int burnInIteration)
                             throws Exception
This method allows the user to parse the set of parameters with index burnInIteration of a specific sampling (from a file).

Parameters:
sampling - the index of the sampling
burnInIteration - the number of iterations that should be skipped
Returns:
true if the parameter set could be parsed
Throws:
Exception - if something went wrong while reading or parsing the parameter set

parseNextParameterSet

protected boolean parseNextParameterSet()
                                 throws Exception
This method allows the user to parse the next set of parameters (from a file).

Returns:
true if the parameter set could be parsed
Throws:
Exception - if something went wrong while reading or parsing the parameter set

initModelForSampling

protected void initModelForSampling(int starts)
                             throws IOException
This method initializes the model for the sampling. For instance this method can be used to create new files where all parameter sets will be stored.

Parameters:
starts - the number of sampling starts
Throws:
IOException - if the files could not be handled properly

extendSampling

protected void extendSampling(int sampling)
                       throws Exception
This method prepares the model to extend an existing sampling.

Parameters:
sampling - the index of the sampling
Throws:
Exception - if the internal files could not be handled properly

samplingStopped

protected void samplingStopped()
                        throws IOException
This method is the opposite of the method initModelForSampling(int). It can be used for closing any streams of writer, ...

Throws:
IOException - if the FileWriter could not be closed properly

isInSamplingMode

protected boolean isInSamplingMode()
This method returns true if the object is currently used in a sampling, otherwise false.

Returns:
true if the object is currently used in a sampling

finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable

draw

public static final int draw(double[] w,
                             int start)
This method draws an index of an array corresponding to the probabilities encoded in the entries of the array.

Parameters:
w - an array containing probabilities starting at position start
start - the start index
Returns:
the drawn index

max

public static final int max(double[] w,
                            int start,
                            int end)
This method returns the index of a maximal entry in the array w between index start and end.

Parameters:
w - an array
start - the start index (inclusive)
end - the end index (exclusive)
Returns:
the index of the maximal entry