de.jstacs.motifDiscovery
Class KMereStatistic

java.lang.Object
  extended by de.jstacs.motifDiscovery.KMereStatistic

public final class KMereStatistic
extends Object

This class enables the user to get some statistics of a DataSet in an easy way.

Author:
Jens Keilwagen

Constructor Summary
KMereStatistic(DataSet data, int k)
          This constructor creates an internal statistic counting all k-mers in the data.
 
Method Summary
static DataSet.WeightedDataSetFactory getAbsoluteKMereFrequencies(DataSet data, int k, boolean bothStrands)
          This method enables the user to get a statistic over all k-mers in the data.
static DataSet.WeightedDataSetFactory getAbsoluteKMereFrequencies(DataSet data, int k, boolean bothStrands, DataSet.WeightedDataSetFactory.SortOperation sortOp)
          This method enables the user to get a statistic over all k-mers in the data.
static Sequence[] getCommonString(DataSet data, int motifLength, boolean bothStrands)
          This method returns an array of sequences of length motifLength so that each string is contained in all sequences of the data set, more precisely in the data set or the reverse complementary data set.
static LinkedList<Sequence> getConservedPatterns(Hashtable<Sequence,BitSet[]> statistic, int dataSetIndex, int threshold)
          This method returns a list of Sequences.
static Pair<Sequence,BitSet[]>[] getKmereSequenceStatistic(boolean bothStrands, int maxMismatch, HashSet<Sequence> filter, DataSet... data)
          This method enables the user to get a statistic for a set of k-mers.
static Hashtable<Sequence,BitSet[]> getKmereSequenceStatistic(int k, boolean bothStrands, int addIndex, DataSet... data)
          This method enables the user to get a statistic over all k-mers in the sequences.
 double[][] getSmoothedProfile(int window, Sequence... seq)
          This method returns an array of smoothed profiles.
 double[][] getSmoothedProfile(int window, String... kmere)
          This method returns an array of smoothed profiles.
static Hashtable<Sequence,BitSet[]> merge(Hashtable<Sequence,BitSet[]> statistic, int maximalMissmatch, boolean bothStrands)
          This method allows to merge the statistics of k-mers by allowing mismatches.
static Hashtable<Sequence,BitSet[]> removeBackground(Hashtable<Sequence,BitSet[]> statistic, int fgIndex, int bgIndex, double fgWeight, double bgWeight)
          This method allows to remove those entries from the statistic that have a lower weighted foreground cardinality than the weighted background cardinality.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KMereStatistic

public KMereStatistic(DataSet data,
                      int k)
This constructor creates an internal statistic counting all k-mers in the data.

Parameters:
data - the data
k - the number of symbols in each counted word
Method Detail

getSmoothedProfile

public double[][] getSmoothedProfile(int window,
                                     String... kmere)
This method returns an array of smoothed profiles. For each k-mere it returns one profile. The order of the profile is the same as the order of the k-meres.

Parameters:
window - the window length, for no smoothing use 1
kmere - the k-mere
Returns:
an array of smoothed profiles
See Also:
getSmoothedProfile(int, Sequence...), Sequence.create(AlphabetContainer, String)

getSmoothedProfile

public double[][] getSmoothedProfile(int window,
                                     Sequence... seq)
This method returns an array of smoothed profiles. For each k-mere it returns one profile. The order of the profile is the same as the order of the k-meres.

Parameters:
window - the window length, for no smoothing use 1
seq - the Sequence instances containing the k-meres
Returns:
an array of smoothed profiles

getCommonString

public static Sequence[] getCommonString(DataSet data,
                                         int motifLength,
                                         boolean bothStrands)
                                  throws Exception
This method returns an array of sequences of length motifLength so that each string is contained in all sequences of the data set, more precisely in the data set or the reverse complementary data set.

Parameters:
data - the data set of sequences
motifLength - the motif length
bothStrands - the switch for using both strand true or only forward strand false
Returns:
an array of sequences of length motifLength so that each sequence is contained in data on either strand
Throws:
Exception - if something went wrong

getAbsoluteKMereFrequencies

public static DataSet.WeightedDataSetFactory getAbsoluteKMereFrequencies(DataSet data,
                                                                         int k,
                                                                         boolean bothStrands)
                                                                  throws Exception
This method enables the user to get a statistic over all k-mers in the data. That is it counts the outcome of each k-mere in the complete data.

Parameters:
data - the data set of sequences
k - the motif length
bothStrands - the switch for using both strand true or only forward strand false. If true for each k-mer only this k-mere or its reverse complement is contained in the returned DataSet.WeightedDataSetFactory.
Returns:
a DataSet.WeightedDataSetFactory containing all k-mers and their absolute frequencies in data respectively on one strand of the data
Throws:
Exception - if something went wrong
See Also:
getAbsoluteKMereFrequencies(DataSet, int, boolean, DataSet.WeightedDataSetFactory.SortOperation), DataSet.WeightedDataSetFactory.SortOperation.NO_SORT

getAbsoluteKMereFrequencies

public static DataSet.WeightedDataSetFactory getAbsoluteKMereFrequencies(DataSet data,
                                                                         int k,
                                                                         boolean bothStrands,
                                                                         DataSet.WeightedDataSetFactory.SortOperation sortOp)
                                                                  throws Exception
This method enables the user to get a statistic over all k-mers in the data. That is it counts the outcome of each k-mere in the complete data.

Parameters:
data - the data set of sequences
k - the motif length
bothStrands - the switch for using both strand true or only forward strand false. If true for each k-mer only this k-mere or its reverse complement is contained in the returned DataSet.WeightedDataSetFactory.
sortOp - the way how the result should be sorted
Returns:
a DataSet.WeightedDataSetFactory containing all k-mers and their absolute frequencies in data respectively on one strand of the data
Throws:
Exception - if something went wrong

getKmereSequenceStatistic

public static Hashtable<Sequence,BitSet[]> getKmereSequenceStatistic(int k,
                                                                     boolean bothStrands,
                                                                     int addIndex,
                                                                     DataSet... data)
                                                              throws WrongAlphabetException,
                                                                     OperationNotSupportedException
This method enables the user to get a statistic over all k-mers in the sequences. That is, it creates for each occurring k-mer an array of BitSets indicating for each data set and each sequence whether it contains the k-mer (or its reverse complement) or not.

Parameters:
data - the DataSets of Sequences
k - the motif length
bothStrands - the switch for using both strand true or only forward strand false. If true for each k-mer only this k-mere or its reverse complement is contained in the returned DataSet.WeightedDataSetFactory.
addIndex - the maximal index for inserting new k-meres
Returns:
a Hashtable on Sequences and arrays of BitSets; each entry encodes a k-mer and the occurrence of this k-mer in each data set and sequence; if a k-mer occurs in data set d in sequence n the n-th bit of the d-th BitSet is true.
Throws:
WrongAlphabetException - if the AlphabetContainers of the DataSets do not match or if they are not simple and discrete
OperationNotSupportedException - if the bothStrands==true but the reverse complement could not be computed
See Also:
Hashtable, merge(Hashtable, int, boolean)

getKmereSequenceStatistic

public static Pair<Sequence,BitSet[]>[] getKmereSequenceStatistic(boolean bothStrands,
                                                                  int maxMismatch,
                                                                  HashSet<Sequence> filter,
                                                                  DataSet... data)
                                                           throws WrongAlphabetException,
                                                                  OperationNotSupportedException
This method enables the user to get a statistic for a set of k-mers. That is, it creates for each k-mer from filter an array of BitSets indicating for each data set and each sequence whether it contains the k-mer (or its reverse complement) or not.

Parameters:
bothStrands - the switch for using both strand true or only forward strand false. If true for each k-mer only this k-mere or its reverse complement is contained in the returned DataSet.WeightedDataSetFactory.
maxMismatch - the maximal number of mismatches
filter - a filter containing all interesting k-mers
data - the DataSets of Sequences
Returns:
a Hashtable on Sequences and arrays of BitSets; each entry encodes a k-mer and the occurrence of this k-mer in each data set and sequence; if a k-mer occurs in data set d in sequence n the n-th bit of the d-th BitSet is true.
Throws:
WrongAlphabetException - if the AlphabetContainers of the DataSets do not match or if they are not simple and discrete
OperationNotSupportedException - if the bothStrands==true but the reverse complement could not be computed
See Also:
Hashtable, merge(Hashtable, int, boolean)

merge

public static Hashtable<Sequence,BitSet[]> merge(Hashtable<Sequence,BitSet[]> statistic,
                                                 int maximalMissmatch,
                                                 boolean bothStrands)
                                          throws OperationNotSupportedException,
                                                 CloneNotSupportedException,
                                                 WrongLengthException,
                                                 WrongAlphabetException
This method allows to merge the statistics of k-mers by allowing mismatches.

Parameters:
statistic - a statistic as obtained from getKmereSequenceStatistic(int, boolean, int, DataSet...)
maximalMissmatch - the maximal number of allowed mismatches
bothStrands - the switch for using both strand true or only forward strand false.
Returns:
a merged statistic
Throws:
OperationNotSupportedException - if the bothStrands==true but the reverse complement could not be computed
CloneNotSupportedException - if an array of BitSet can not be cloned
WrongAlphabetException - see Sequence.getHammingDistance(Sequence)
WrongLengthException - see Sequence.getHammingDistance(Sequence)
See Also:
Sequence.getHammingDistance(Sequence), getKmereSequenceStatistic(int, boolean, int, DataSet...)

getConservedPatterns

public static LinkedList<Sequence> getConservedPatterns(Hashtable<Sequence,BitSet[]> statistic,
                                                        int dataSetIndex,
                                                        int threshold)
This method returns a list of Sequences. Each entry corresponds to a sequence or a set of sequences (depending on the input of the statistic) that occurs in more than threshold Sequences of the data set.

Parameters:
statistic - a statistic as obtained from getKmereSequenceStatistic(int, boolean, int, DataSet...) or merge(Hashtable, int, boolean)
dataSetIndex - the index of the BitSet to be used
threshold - a threshold that has to be exceeded by BitSet.cardinality() to be declared as a conserved pattern
Returns:
a list of conserved patterns
See Also:
getKmereSequenceStatistic(int, boolean, int, DataSet...), merge(Hashtable, int, boolean)

removeBackground

public static Hashtable<Sequence,BitSet[]> removeBackground(Hashtable<Sequence,BitSet[]> statistic,
                                                            int fgIndex,
                                                            int bgIndex,
                                                            double fgWeight,
                                                            double bgWeight)
This method allows to remove those entries from the statistic that have a lower weighted foreground cardinality than the weighted background cardinality.

Parameters:
statistic - a statistic as obtained from getKmereSequenceStatistic(int, boolean, int, DataSet...) or merge(Hashtable, int, boolean)
fgIndex - the foreground index of the BitSet to be used
bgIndex - the background index of the BitSet to be used
fgWeight - the weight used to weight the foreground cardinality
bgWeight - the weight used to weight the background cardinality
Returns:
a Hashtable containing only the positive entries