de.jstacs.sequenceScores.statisticalModels.trainable
Class AbstractTrainableStatisticalModel

java.lang.Object
  extended by de.jstacs.sequenceScores.statisticalModels.trainable.AbstractTrainableStatisticalModel
All Implemented Interfaces:
SequenceScore, StatisticalModel, TrainableStatisticalModel, Storable, Cloneable
Direct Known Subclasses:
AbstractHMM, AbstractMixtureTrainSM, CompositeTrainSM, DifferentiableStatisticalModelWrapperTrainSM, DiscreteGraphicalTrainSM, UniformTrainSM, VariableLengthWrapperTrainSM

public abstract class AbstractTrainableStatisticalModel
extends Object
implements Cloneable, Storable, TrainableStatisticalModel

Abstract class for a model for pattern recognition.
For writing or reading a StringBuffer to or from a file ( fromXML(StringBuffer), Storable.toXML()) you can use the class FileManager.

Author:
Andre Gohr, Jan Grau, Jens Keilwagen
See Also:
FileManager

Field Summary
protected  AlphabetContainer alphabets
          The underlying alphabets
protected  int length
          The length of the sequences the model can classify.
 
Constructor Summary
AbstractTrainableStatisticalModel(AlphabetContainer alphabets, int length)
          Constructor that sets the length of the model to length and the AlphabetContainer to alphabets.
AbstractTrainableStatisticalModel(StringBuffer stringBuff)
          The standard constructor for the interface Storable.
 
Method Summary
protected  void check(Sequence sequence, int startpos, int endpos)
          This method checks all parameters before a probability can be computed for a sequence.
 AbstractTrainableStatisticalModel clone()
          Follows the conventions of Object's clone()-method.
 DataSet emitDataSet(int numberOfSequences, int... seqLength)
          This method returns a DataSet object containing artificial sequence(s).
protected abstract  void fromXML(StringBuffer xml)
          This method should only be used by the constructor that works on a StringBuffer.
 AlphabetContainer getAlphabetContainer()
          Returns the container of alphabets that were used when constructing the instance.
 ResultSet getCharacteristics()
          Returns some information characterizing or describing the current instance.
 int getLength()
          Returns the length of sequences this instance can score.
 double getLogProbFor(Sequence sequence)
          Returns the logarithm of the probability of the given sequence given the model.
 double getLogProbFor(Sequence sequence, int startpos)
          Returns the logarithm of the probability of (a part of) the given sequence given the model.
 double[] getLogScoreFor(DataSet data)
          This method computes the logarithm of the scores of all sequences in the given data set.
 void getLogScoreFor(DataSet data, double[] res)
          This method computes and stores the logarithm of the scores for any sequence in the data set in the given double-array.
 double getLogScoreFor(Sequence sequence)
          Returns the logarithmic score for the Sequence seq.
 double getLogScoreFor(Sequence sequence, int startpos)
          Returns the logarithmic score for the Sequence seq beginning at position start in the Sequence.
 double getLogScoreFor(Sequence sequence, int startpos, int endpos)
          Returns the logarithmic score for the Sequence seq beginning at position start in the Sequence.
 byte getMaximalMarkovOrder()
          This method returns the maximal used Markov order, if possible.
 String toString()
          Should give a simple representation (text) of the model as String.
 void train(DataSet data)
          Trains the TrainableStatisticalModel object given the data as DataSet.
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface de.jstacs.sequenceScores.statisticalModels.trainable.TrainableStatisticalModel
train
 
Methods inherited from interface de.jstacs.sequenceScores.statisticalModels.StatisticalModel
getLogPriorTerm, getLogProbFor
 
Methods inherited from interface de.jstacs.sequenceScores.SequenceScore
getInstanceName, getNumericalCharacteristics, isInitialized, toString
 
Methods inherited from interface de.jstacs.Storable
toXML
 

Field Detail

length

protected int length
The length of the sequences the model can classify. For models that can take sequences of arbitrary length this value should be set to 0


alphabets

protected AlphabetContainer alphabets
The underlying alphabets

Constructor Detail

AbstractTrainableStatisticalModel

public AbstractTrainableStatisticalModel(AlphabetContainer alphabets,
                                         int length)
Constructor that sets the length of the model to length and the AlphabetContainer to alphabets.
The parameter length gives the length of the sequences the model can classify. Models that can only classify sequences of defined length are e.g. PWM or inhomogeneous Markov models. If the model can classify sequences of arbitrary length, e.g. homogeneous Markov models, this parameter must be set to 0 (zero).
The length and alphabets define the type of data that can be modeled and therefore both has to be checked before any evaluation (e.g. getLogScoreFor(Sequence))

Parameters:
alphabets - the alphabets in an AlphabetContainer
length - the length of the sequences a model can classify, 0 for arbitrary length

AbstractTrainableStatisticalModel

public AbstractTrainableStatisticalModel(StringBuffer stringBuff)
                                  throws NonParsableException
The standard constructor for the interface Storable. Creates a new AbstractTrainableStatisticalModel out of a StringBuffer.

Parameters:
stringBuff - the StringBuffer to be parsed
Throws:
NonParsableException - is thrown if the StringBuffer could not be parsed
Method Detail

clone

public AbstractTrainableStatisticalModel clone()
                                        throws CloneNotSupportedException
Follows the conventions of Object's clone()-method.

Specified by:
clone in interface SequenceScore
Specified by:
clone in interface TrainableStatisticalModel
Overrides:
clone in class Object
Returns:
an object, that is a copy of the current AbstractTrainableStatisticalModel (the member-AlphabetContainer isn't deeply cloned since it is assumed to be immutable). The type of the returned object is defined by the class X directly inherited from AbstractTrainableStatisticalModel. Hence X's clone()-method should work as:
1. Object o = (X)super.clone();
2. all additional member variables of o defined by X that are not of simple data-types like int, double, ... have to be deeply copied
3. return o
Throws:
CloneNotSupportedException - if something went wrong while cloning

train

public void train(DataSet data)
           throws Exception
Description copied from interface: TrainableStatisticalModel
Trains the TrainableStatisticalModel object given the data as DataSet.
This method should work non-incrementally. That means the result of the following series: train(data1); train(data2) should be a fully trained model over data2 and not over data1+data2. All parameters of the model were given by the call of the constructor.

Specified by:
train in interface TrainableStatisticalModel
Parameters:
data - the given sequences as DataSet
Throws:
Exception - if the training did not succeed
See Also:
DataSet.getElementAt(int), DataSet.ElementEnumerator

getLogProbFor

public double getLogProbFor(Sequence sequence)
                     throws Exception
Description copied from interface: StatisticalModel
Returns the logarithm of the probability of the given sequence given the model. If at least one random variable is continuous the value of density function is returned.

The length and the alphabets define the type of data that can be modeled and therefore both has to be checked.

Specified by:
getLogProbFor in interface StatisticalModel
Parameters:
sequence - the given sequence for which the logarithm of the probability/the value of the density function should be returned
Returns:
the logarithm of the probability or the value of the density function of the part of the given sequence given the model
Throws:
Exception - if the sequence could not be handled by the model
NotTrainedException - if the model is not trained yet
See Also:
StatisticalModel.getLogProbFor(Sequence, int, int)

getLogProbFor

public double getLogProbFor(Sequence sequence,
                            int startpos)
                     throws Exception
Description copied from interface: StatisticalModel
Returns the logarithm of the probability of (a part of) the given sequence given the model. If at least one random variable is continuous the value of density function is returned.

If the length of the sequences, whose probability should be returned, is fixed (e.g. in a inhomogeneous model) and the given sequence is longer than their fixed length, the start position within the given sequence is given by startpos. E.g. the fixed length is 12. The length of the given sequence is 30 and the startpos=15 the logarithm of the probability of the part from position 15 to 26 (inclusive) given the model should be returned.
The length and the alphabets define the type of data that can be modeled and therefore both has to be checked.

Specified by:
getLogProbFor in interface StatisticalModel
Parameters:
sequence - the given sequence
startpos - the start position within the given sequence
Returns:
the logarithm of the probability or the value of the density function of (the part of) the given sequence given the model
Throws:
Exception - if the sequence could not be handled by the model
NotTrainedException - if the model is not trained yet
See Also:
StatisticalModel.getLogProbFor(Sequence, int, int)

check

protected void check(Sequence sequence,
                     int startpos,
                     int endpos)
              throws NotTrainedException,
                     IllegalArgumentException
This method checks all parameters before a probability can be computed for a sequence. Hence, should be used in StatisticalModel.getLogProbFor(Sequence, int, int).

Parameters:
sequence - the given sequence
startpos - the start position within the given sequence
endpos - the last position to be taken into account
Throws:
IllegalArgumentException - if the sequence could not be handled (e.g. startpos > , endpos > sequence.length, ...) by the model
NotTrainedException - if the model is not trained yet

getLogScoreFor

public double getLogScoreFor(Sequence sequence)
Description copied from interface: SequenceScore
Returns the logarithmic score for the Sequence seq.

Specified by:
getLogScoreFor in interface SequenceScore
Parameters:
sequence - the sequence
Returns:
the logarithmic score for the sequence

getLogScoreFor

public double getLogScoreFor(Sequence sequence,
                             int startpos)
Description copied from interface: SequenceScore
Returns the logarithmic score for the Sequence seq beginning at position start in the Sequence.

Specified by:
getLogScoreFor in interface SequenceScore
Parameters:
sequence - the Sequence
startpos - the start position in the Sequence
Returns:
the logarithmic score for the Sequence

getLogScoreFor

public double getLogScoreFor(Sequence sequence,
                             int startpos,
                             int endpos)
Description copied from interface: SequenceScore
Returns the logarithmic score for the Sequence seq beginning at position start in the Sequence.

Specified by:
getLogScoreFor in interface SequenceScore
Parameters:
sequence - the Sequence
startpos - the start position in the Sequence
endpos - the end position (inclusive) in the Sequence
Returns:
the logarithmic score for the Sequence

getLogScoreFor

public double[] getLogScoreFor(DataSet data)
                        throws Exception
Description copied from interface: SequenceScore
This method computes the logarithm of the scores of all sequences in the given data set. The values are stored in an array according to the index of the respective sequence in the data set.

The score for any sequence shall be computed independent of all other sequences in the data set. So the result should be exactly the same as for the method SequenceScore.getLogScoreFor(Sequence).

Specified by:
getLogScoreFor in interface SequenceScore
Parameters:
data - the data set of sequences
Returns:
an array containing the logarithm of the score of all sequences of the data set
Throws:
Exception - if something went wrong
See Also:
SequenceScore.getLogScoreFor(Sequence)

getLogScoreFor

public void getLogScoreFor(DataSet data,
                           double[] res)
                    throws Exception
Description copied from interface: SequenceScore
This method computes and stores the logarithm of the scores for any sequence in the data set in the given double-array.

The score for any sequence shall be computed independent of all other sequences in the data set. So the result should be exactly the same as for the method SequenceScore.getLogScoreFor(Sequence).

Specified by:
getLogScoreFor in interface SequenceScore
Parameters:
data - the data set of sequences
res - the array for the results, has to have length data.getNumberOfElements() (which returns the number of sequences in the data set)
Throws:
Exception - if something went wrong
See Also:
SequenceScore.getLogScoreFor(Sequence), SequenceScore.getLogScoreFor(DataSet)

emitDataSet

public DataSet emitDataSet(int numberOfSequences,
                           int... seqLength)
                    throws NotTrainedException,
                           Exception
Description copied from interface: StatisticalModel
This method returns a DataSet object containing artificial sequence(s).

There are two different possibilities to create a data set for a model with length 0 (homogeneous models).
  1. emitDataSet( int n, int l ) should return a data set with n sequences of length l.
  2. emitDataSet( int n, int[] l ) should return a data set with n sequences which have a sequence length corresponding to the entry in the given array l.

There are two different possibilities to create a data set for a model with length greater than 0 (inhomogeneous models).
emitDataSet( int n ) and emitDataSet( int n, null ) should return a data set with n sequences of length of the model ( SequenceScore.getLength()).

The standard implementation throws an Exception.

Specified by:
emitDataSet in interface StatisticalModel
Parameters:
numberOfSequences - the number of sequences that should be contained in the returned data set
seqLength - the length of the sequences for a homogeneous model; for an inhomogeneous model this parameter should be null or an array of size 0.
Returns:
a DataSet containing the artificial sequence(s)
Throws:
NotTrainedException - if the model is not trained yet
Exception - if the emission did not succeed
See Also:
DataSet

getAlphabetContainer

public final AlphabetContainer getAlphabetContainer()
Description copied from interface: SequenceScore
Returns the container of alphabets that were used when constructing the instance.

Specified by:
getAlphabetContainer in interface SequenceScore
Returns:
the container of alphabets that were used when constructing the instance

getLength

public final int getLength()
Description copied from interface: SequenceScore
Returns the length of sequences this instance can score. Instances that can only score sequences of defined length are e.g. PWM or inhomogeneous Markov models. If the instance can score sequences of arbitrary length, e.g. homogeneous Markov models, this method returns 0 (zero).

Specified by:
getLength in interface SequenceScore
Returns:
the length of sequences the instance can score

getMaximalMarkovOrder

public byte getMaximalMarkovOrder()
                           throws UnsupportedOperationException
Description copied from interface: StatisticalModel
This method returns the maximal used Markov order, if possible.

Specified by:
getMaximalMarkovOrder in interface StatisticalModel
Returns:
maximal used Markov order
Throws:
UnsupportedOperationException - if the model can't give a proper answer

getCharacteristics

public ResultSet getCharacteristics()
                             throws Exception
Description copied from interface: SequenceScore
Returns some information characterizing or describing the current instance. This could be e.g. the number of edges for a Bayesian network or an image showing some representation of the instance. The set of characteristics should always include the XML-representation of the instance. The corresponding result type is StorableResult.

Specified by:
getCharacteristics in interface SequenceScore
Returns:
the characteristics of the current instance
Throws:
Exception - if some of the characteristics could not be defined
See Also:
StorableResult

fromXML

protected abstract void fromXML(StringBuffer xml)
                         throws NonParsableException
This method should only be used by the constructor that works on a StringBuffer. It is the counter part of Storable.toXML().

Parameters:
xml - the XML representation of the model
Throws:
NonParsableException - if the StringBuffer is not parsable or the representation is conflicting
See Also:
AbstractTrainableStatisticalModel(StringBuffer)

toString

public final String toString()
Description copied from interface: TrainableStatisticalModel
Should give a simple representation (text) of the model as String.

Specified by:
toString in interface TrainableStatisticalModel
Overrides:
toString in class Object
Returns:
the representation as String