de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif
Class ZOOPSTrainSM

java.lang.Object
  extended by de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
      extended by de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
          extended by de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif.HiddenMotifMixture
              extended by de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif.ZOOPSTrainSM
All Implemented Interfaces:
MotifDiscoverer, SequenceScore, StatisticalModel, TrainableStatisticalModel, Storable, Cloneable

public class ZOOPSTrainSM
extends HiddenMotifMixture

This class enables the user to search for a single motif in a sequence. The user is enabled to train the model either "one occurrence per sequence" (=OOPS) or "zero or one occurrence per sequence" (=ZOOPS).

If EM is used for training the parameters are trained in a MEME-like manner.

Currently only EM is implemented.

Author:
Jens Keilwagen

Nested Class Summary
 
Nested classes/interfaces inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
AbstractMixtureTrainSM.Algorithm, AbstractMixtureTrainSM.Parameterization
 
Nested classes/interfaces inherited from interface de.jstacs.motifDiscovery.MotifDiscoverer
MotifDiscoverer.KindOfProfile
 
Field Summary
protected  byte bgMaxMarkovOrder
          The order of the background model.
 
Fields inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif.HiddenMotifMixture
posPrior
 
Fields inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
algorithm, algorithmHasBeenRun, alternativeModel, best, burnInTest, componentHyperParams, compProb, counter, dimension, estimateComponentProbs, file, filereader, filewriter, initialIteration, logWeights, model, optimizeModel, sample, samplingIndex, seqWeights, sostream, starts, stationaryIteration, weights
 
Fields inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
alphabets, length
 
Constructor Summary
  ZOOPSTrainSM(StringBuffer xml)
          The standard constructor for the interface Storable.
protected ZOOPSTrainSM(TrainableStatisticalModel motif, TrainableStatisticalModel bg, boolean trainOnlyMotifModel, int starts, double[] componentHyperParams, double[] weights, PositionPrior posPrior, AbstractMixtureTrainSM.Algorithm algorithm, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization, int initialIteration, int stationaryIteration, BurnInTest burnInTest)
          Creates a new ZOOPSTrainSM.
  ZOOPSTrainSM(TrainableStatisticalModel motif, TrainableStatisticalModel bg, boolean trainOnlyMotifModel, int starts, double[] componentHyperParams, PositionPrior posPrior, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization)
          Creates a new ZOOPSTrainSM using EM and estimating the probability for finding a motif.
  ZOOPSTrainSM(TrainableStatisticalModel motif, TrainableStatisticalModel bg, boolean trainOnlyMotifModel, int starts, double motifProb, PositionPrior posPrior, double alpha, TerminationCondition tc, AbstractMixtureTrainSM.Parameterization parametrization)
          Creates a new ZOOPSTrainSM using EM and fixed probability for finding a motif.
 
Method Summary
protected  double[][] createSeqWeightsArray()
          Creates an array that can be used for weighting sequences in the algorithm.
protected  double[][] doFirstIteration(double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method will do the first step in the train algorithm for the current model on the internal data set.
 int getGlobalIndexOfMotifInComponent(int component, int motif)
          Returns the global index of the motif used in component.
protected  double getLogProbUsingCurrentParameterSetFor(int component, Sequence seq, int start, int end)
          Returns the logarithmic probability for the sequence and the given component using the current parameter set.
 int getMinimalSequenceLength()
          Returns the minimal length a sequence respectively a data set has to have.
 int getMotifLength(int motif)
          This method returns the length of the motif with index motif .
protected  double getNewWeights(double[] dataWeights, double[] w, double[][] seqweights)
          Computes sequence weights and returns the score.
 int getNumberOfMotifs()
          Returns the number of motifs for this MotifDiscoverer.
 int getNumberOfMotifsInComponent(int component)
          Returns the number of motifs that are used in the component component of this MotifDiscoverer.
 double[] getProfileOfScoresFor(int component, int motif, Sequence sequence, int startpos, MotifDiscoverer.KindOfProfile kind)
          Returns the profile of the scores for component component and motif motif at all possible start positions of the motif in the sequence sequence beginning at startpos.
 double[] getStrandProbabilitiesFor(int component, int motif, Sequence sequence, int startpos)
          This method returns the probabilities of the strand orientations for a given subsequence if it is considered as site of the motif model in a specific component.
protected  double iterate(int start, double[] dataWeights, MultivariateRandomGenerator m, MRGParams[] params)
          This method runs the train algorithm for the current model and the internal data set.
protected  double modify(double[] containsMotif, double[] startpos, int start, int end)
          This method modifies the computed weights for one sequence and returns the score.
 void setShiftCorrection(boolean correct)
          Enables or disables the phase shift correction.
protected  void setTrainData(DataSet data)
          This method is invoked by the train-method and sets for a given data set the data set that should be used for train.
 void trainBgModel(DataSet data, double[] weights)
          This method trains the background model.
 
Methods inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif.HiddenMotifMixture
checkLength, clone, emitDataSetUsingCurrentParameterSet, extractFurtherInformation, getFurtherInformation, getInstanceName, getNewParameters, toString, train
 
Methods inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM
algorithmHasBeenRun, checkModelsForGibbsSampling, continueIterations, continueIterations, doFirstIteration, doFirstIteration, draw, emitDataSet, extendSampling, finalize, fromXML, getCharacteristics, getIndexOfMaximalComponentFor, getLogPriorTerm, getLogPriorTermForComponentProbs, getLogProbFor, getLogProbFor, getLogScoreFor, getModel, getModels, getMRG, getMRGParams, getNameOfAlgorithm, getNewComponentProbs, getNewParametersForModel, getNumberOfComponents, getNumericalCharacteristics, getScoreForBestRun, getWeights, initModelForSampling, initWithPrior, isInitialized, isInSamplingMode, iterate, max, modifyWeights, parseNextParameterSet, parseParameterSet, samplingStopped, setAlpha, setOutputStream, setWeights, swap, toXML
 
Methods inherited from class de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
check, getAlphabetContainer, getLength, getLogProbFor, getLogProbFor, getLogScoreFor, getLogScoreFor, getLogScoreFor, getLogScoreFor, getMaximalMarkovOrder, toString, train
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface de.jstacs.motifDiscovery.MotifDiscoverer
getIndexOfMaximalComponentFor, getNumberOfComponents
 
Methods inherited from interface de.jstacs.Storable
toXML
 

Field Detail

bgMaxMarkovOrder

protected byte bgMaxMarkovOrder
The order of the background model.

Constructor Detail

ZOOPSTrainSM

protected ZOOPSTrainSM(TrainableStatisticalModel motif,
                       TrainableStatisticalModel bg,
                       boolean trainOnlyMotifModel,
                       int starts,
                       double[] componentHyperParams,
                       double[] weights,
                       PositionPrior posPrior,
                       AbstractMixtureTrainSM.Algorithm algorithm,
                       double alpha,
                       TerminationCondition tc,
                       AbstractMixtureTrainSM.Parameterization parametrization,
                       int initialIteration,
                       int stationaryIteration,
                       BurnInTest burnInTest)
                throws CloneNotSupportedException,
                       IllegalArgumentException,
                       WrongAlphabetException
Creates a new ZOOPSTrainSM. This constructor can be used for any algorithm since it takes all necessary values as parameters.

Parameters:
motif - the motif model, if the model is trained using AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the model has to implement SamplingComponent.
bg - the background model for the flanking sequences and for those sequences that do not contain a binding site, if trainOnlyMotifModel == false and algorithm == AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING the model has to implement SamplingComponent. The model has to be able to score sequences of arbitrary length.
trainOnlyMotifModel - a switch whether to train only the motif model
starts - the number of times the algorithm will be started in the train-method, at least 1
componentHyperParams - the hyperparameters for the component assignment prior
  • will only be used if estimateComponentProbs == true
  • the array has to be null or has to have length dimension
  • null or an array with all values zero (0) than ML
  • otherwise (all values positive) a prior is used (MAP, MP, ...)
  • depends on the parameterization
weights - null or the weights for the components (then weights.length == dimension)
posPrior - this object determines the positional distribution that shall be used
algorithm - either AbstractMixtureTrainSM.Algorithm.EM or AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
alpha - only for AbstractMixtureTrainSM.Algorithm.EM
the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - only for AbstractMixtureTrainSM.Algorithm.EM
the type of the component probability parameterization;
initialIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the initial sampling phase (at least 1, at most stationaryIteration/starts)
stationaryIteration - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the positive length of the stationary phase (at least 1) (summed over all starts), i.e. the number of parameter sets that is used in approximation
burnInTest - only for AbstractMixtureTrainSM.Algorithm.GIBBS_SAMPLING
the test that will be used to determine the length of the burn-in phase
Throws:
CloneNotSupportedException - if
  • the models are not able to score the sequence of the corresponding length
  • weights != null && weights.length != 2
  • weights != null and it exists an i where weights[i] < 0
  • starts < 1
  • componentHyperParams are not correct
  • the algorithm specific parameters are not correct
IllegalArgumentException - if not all models work on the same simple alphabet
WrongAlphabetException - if the models can not be cloned

ZOOPSTrainSM

public ZOOPSTrainSM(TrainableStatisticalModel motif,
                    TrainableStatisticalModel bg,
                    boolean trainOnlyMotifModel,
                    int starts,
                    double[] componentHyperParams,
                    PositionPrior posPrior,
                    double alpha,
                    TerminationCondition tc,
                    AbstractMixtureTrainSM.Parameterization parametrization)
             throws CloneNotSupportedException,
                    IllegalArgumentException,
                    WrongAlphabetException
Creates a new ZOOPSTrainSM using EM and estimating the probability for finding a motif.

Parameters:
motif - the motif model
bg - the background model for the flanking sequences and for those sequences that do not contain a binding site. The model has to be able to score sequences of arbitrary length.
starts - the number of times the algorithm will be started in the train-method, at least 1
componentHyperParams - the hyperparameters for the component assignment prior
  • will only be used if estimateComponentProbs == true
  • the array has to be null or has to have length dimension
  • null or an array with all values zero (0) then ML
  • otherwise (all values positive) a prior is used (MAP, MP, ...)
  • depends on the parameterization
posPrior - this object determines the positional distribution that shall be used
trainOnlyMotifModel - a switch whether to train only the motif model
alpha - the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - the type of the component probability parameterization
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of the corresponding length
  • starts < 1
  • componentHyperParams are not correct
WrongAlphabetException - if not all models work on the same simple alphabet
CloneNotSupportedException - if the models can not be cloned
See Also:
ZOOPSTrainSM(de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel, de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel, boolean, int, double[], double[], de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif.positionprior.PositionPrior, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Algorithm, double, de.jstacs.algorithms.optimization.termination.TerminationCondition, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Parameterization, int, int, de.jstacs.sampling.BurnInTest), AbstractMixtureTrainSM.Algorithm.EM

ZOOPSTrainSM

public ZOOPSTrainSM(TrainableStatisticalModel motif,
                    TrainableStatisticalModel bg,
                    boolean trainOnlyMotifModel,
                    int starts,
                    double motifProb,
                    PositionPrior posPrior,
                    double alpha,
                    TerminationCondition tc,
                    AbstractMixtureTrainSM.Parameterization parametrization)
             throws CloneNotSupportedException,
                    IllegalArgumentException,
                    WrongAlphabetException
Creates a new ZOOPSTrainSM using EM and fixed probability for finding a motif.

Parameters:
motif - the motif model
bg - the background model for the flanking sequences and for those sequences that do not contain a binding site. The model has to be able to score sequences of arbitrary length.
starts - the number of times the algorithm will be started in the train-method, at least 1
motifProb - the probability of finding a motif in a sequence (in [0,1])
posPrior - this object determines the positional distribution that shall be used
trainOnlyMotifModel - a switch whether to train only the motif model
alpha - the positive parameter for the Dirichlet distribution which is used when you invoke train to initialize the gammas. It is recommended to use alpha = 1 (uniform distribution on a simplex).
tc - only for AbstractMixtureTrainSM.Algorithm.EM
the TerminationCondition for stopping the EM-algorithm, tc has to return true from TerminationCondition.isSimple()
parametrization - the type of the component probability parameterization
Throws:
IllegalArgumentException - if
  • the models are not able to score the sequence of the corresponding length
  • motifProb < 0 or motifProb > 1
  • starts < 1
WrongAlphabetException - if not all models work on the same simple alphabet
CloneNotSupportedException - if the models can not be cloned
See Also:
ZOOPSTrainSM(de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel, de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel, boolean, int, double[], double[], de.jstacs.sequenceScores.statisticalModels.trainable.mixture.motif.positionprior.PositionPrior, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Algorithm, double, de.jstacs.algorithms.optimization.termination.TerminationCondition, de.jstacs.sequenceScores.statisticalModels.trainable.mixture.AbstractMixtureTrainSM.Parameterization, int, int, de.jstacs.sampling.BurnInTest), AbstractMixtureTrainSM.Algorithm.EM

ZOOPSTrainSM

public ZOOPSTrainSM(StringBuffer xml)
             throws NonParsableException
The standard constructor for the interface Storable. Creates a new ZOOPSTrainSM out of its XML representation.

Parameters:
xml - the XML representation of the model as a StringBuffer
Throws:
NonParsableException - if the StringBuffer can not be parsed
Method Detail

setTrainData

protected void setTrainData(DataSet data)
                     throws Exception
Description copied from class: AbstractMixtureTrainSM
This method is invoked by the train-method and sets for a given data set the data set that should be used for train.

Specified by:
setTrainData in class AbstractMixtureTrainSM
Parameters:
data - the given data set of sequences
Throws:
Exception - if something went wrong

createSeqWeightsArray

protected double[][] createSeqWeightsArray()
Description copied from class: AbstractMixtureTrainSM
Creates an array that can be used for weighting sequences in the algorithm.

Overrides:
createSeqWeightsArray in class AbstractMixtureTrainSM
Returns:
an array that can be used for weighting sequences in the algorithm

doFirstIteration

protected double[][] doFirstIteration(double[] dataWeights,
                                      MultivariateRandomGenerator m,
                                      MRGParams[] params)
                               throws Exception
Description copied from class: AbstractMixtureTrainSM
This method will do the first step in the train algorithm for the current model on the internal data set. The initialization will be done by randomly setting the component membership. This is useful when nothing is known about the problem.

Specified by:
doFirstIteration in class AbstractMixtureTrainSM
Parameters:
dataWeights - null or the weights of each element of the data set
m - the multivariate random generator
params - the parameters for the multivariate random generator
Returns:
the weighting array used to initialize, this array can be reused in the following iterations
Throws:
Exception - if something went wrong

getNewWeights

protected double getNewWeights(double[] dataWeights,
                               double[] w,
                               double[][] seqweights)
                        throws Exception
Description copied from class: AbstractMixtureTrainSM
Computes sequence weights and returns the score.

Specified by:
getNewWeights in class AbstractMixtureTrainSM
Parameters:
dataWeights - the weights for the internal data set (should not be changed)
w - the array for the statistic of the component parameters (shall be filled)
seqweights - an array containing for each component the weights for each sequence (shall be filled)
Returns:
the score
Throws:
Exception - if something went wrong

modify

protected double modify(double[] containsMotif,
                        double[] startpos,
                        int start,
                        int end)
This method modifies the computed weights for one sequence and returns the score.

Parameters:
containsMotif - an array to return the weights for containing a motif (index 0) or containing no motif (index 1)
startpos - the array containing the scores for each start position (including no motif in the sequence)
start - the start index
end - the end index
Returns:
the score

getLogProbUsingCurrentParameterSetFor

protected double getLogProbUsingCurrentParameterSetFor(int component,
                                                       Sequence seq,
                                                       int start,
                                                       int end)
                                                throws Exception
Description copied from class: AbstractMixtureTrainSM
Returns the logarithmic probability for the sequence and the given component using the current parameter set.

Specified by:
getLogProbUsingCurrentParameterSetFor in class AbstractMixtureTrainSM
Parameters:
component - the index of the component
seq - the sequence
start - the start position in the sequence
end - the end position in the sequence
Returns:
log P(s,component) = log P(s|component) + log P(component)
Throws:
Exception - if not trained yet or something else went wrong
See Also:
AbstractMixtureTrainSM.getNumberOfComponents()

getProfileOfScoresFor

public double[] getProfileOfScoresFor(int component,
                                      int motif,
                                      Sequence sequence,
                                      int startpos,
                                      MotifDiscoverer.KindOfProfile kind)
                               throws Exception
Description copied from interface: MotifDiscoverer
Returns the profile of the scores for component component and motif motif at all possible start positions of the motif in the sequence sequence beginning at startpos. This array should be of length
sequence.length() - startpos - motifs[motif].getLength() + 1.
A high score should encode for a probable start position.

Parameters:
component - the component index
motif - the index of the motif in the component
sequence - the given sequence
startpos - the start position in the sequence
kind - indicates the kind of profile
Returns:
the profile of scores
Throws:
Exception - if the score could not be computed for any reasons

getMinimalSequenceLength

public int getMinimalSequenceLength()
Description copied from class: HiddenMotifMixture
Returns the minimal length a sequence respectively a data set has to have.

Specified by:
getMinimalSequenceLength in class HiddenMotifMixture
Returns:
the minimal length a sequence respectively a data set has to have

getMotifLength

public int getMotifLength(int motif)
Description copied from interface: MotifDiscoverer
This method returns the length of the motif with index motif .

Parameters:
motif - the index of the motif
Returns:
the length of the motif with index motif

getNumberOfMotifs

public int getNumberOfMotifs()
Description copied from interface: MotifDiscoverer
Returns the number of motifs for this MotifDiscoverer.

Returns:
the number of motifs

getNumberOfMotifsInComponent

public int getNumberOfMotifsInComponent(int component)
Description copied from interface: MotifDiscoverer
Returns the number of motifs that are used in the component component of this MotifDiscoverer.

Parameters:
component - the component of the MotifDiscoverer
Returns:
the number of motifs

getStrandProbabilitiesFor

public double[] getStrandProbabilitiesFor(int component,
                                          int motif,
                                          Sequence sequence,
                                          int startpos)
                                   throws Exception
Description copied from interface: MotifDiscoverer
This method returns the probabilities of the strand orientations for a given subsequence if it is considered as site of the motif model in a specific component.

Parameters:
component - the component index
motif - the index of the motif in the component
sequence - the given sequence
startpos - the start position in the sequence
Returns:
the probabilities of the strand orientations
Throws:
Exception - if the strand could not be computed for any reasons

getGlobalIndexOfMotifInComponent

public int getGlobalIndexOfMotifInComponent(int component,
                                            int motif)
Description copied from interface: MotifDiscoverer
Returns the global index of the motif used in component. The index returned must be at least 0 and less than MotifDiscoverer.getNumberOfMotifs().

Parameters:
component - the component index
motif - the motif index in the component
Returns:
the global index of the motif in component

trainBgModel

public void trainBgModel(DataSet data,
                         double[] weights)
                  throws Exception
Description copied from class: HiddenMotifMixture
This method trains the background model. This can be useful if the background model is not trained during the EM-algorithm.

Specified by:
trainBgModel in class HiddenMotifMixture
Parameters:
data - the data set
weights - the weights
Throws:
Exception - if something went wrong

iterate

protected double iterate(int start,
                         double[] dataWeights,
                         MultivariateRandomGenerator m,
                         MRGParams[] params)
                  throws Exception
Description copied from class: AbstractMixtureTrainSM
This method runs the train algorithm for the current model and the internal data set.

Overrides:
iterate in class AbstractMixtureTrainSM
Parameters:
start - the index of the training
dataWeights - the weights for each sequence or null
m - the random generator for initiating the algorithm
params - the parameters for the sequences
Returns:
the score
Throws:
Exception - if something went wrong
See Also:
AbstractMixtureTrainSM.doFirstIteration(DataSet, double[], MultivariateRandomGenerator, MRGParams[]), AbstractMixtureTrainSM.continueIterations(double[], double[][]), AbstractMixtureTrainSM.continueIterations(double[], double[][], int, int)

setShiftCorrection

public void setShiftCorrection(boolean correct)
Enables or disables the phase shift correction. By default, shift correction is enabled.

Parameters:
correct - switch that determines whether to correct shifts or not