DataSet

java.lang.Object
- de.jstacs.data.DataSet

All Implemented Interfaces:

Iterable<Sequence>

Direct Known Subclasses:

DNADataSet
```
public class DataSet
extends Object
implements Iterable<Sequence>
```
This is the class for any data set of Sequences. All Sequences in a DataSet have to have the same AlphabetContainer. The Sequences may have different lengths.
For the internal representation the class Sequence is used, where the external alphabet is converted to integral numerical values. The class DataSet knows about this coding via instances of class AlphabetContainer and accordingly Alphabet.

There are different ways to access the elements of a DataSet. If one needs random access there is the method getElementAt(int). For fast sequential access it is recommended to use an DataSet.ElementEnumerator.

DataSet is immutable.

Author:

Jens Keilwagen, Andre Gohr, Jan Grau

See Also:

AlphabetContainer, Alphabet, Sequence

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`DataSet.ElementEnumerator` This class can be used to have a fast sequential access to a `DataSet`.
`static class`	`DataSet.PartitionMethod` This `enum` defines different partition methods for a `DataSet`.
`static class`	`DataSet.WeightedDataSetFactory` This class enables you to eliminate `Sequence`s that occur more than once in one or more `DataSet`s.

Constructor Summary

Constructors
Constructor and Description
`DataSet(AlphabetContainer abc, AbstractStringExtractor se)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, int subsequenceLength)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer` and all overlapping windows of length `subsequenceLength`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer` and a delimiter `delim`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim, int subsequenceLength)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer`, the given delimiter `delim` and all overlapping windows of length `subsequenceLength`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim, int subsequenceLength, double percentage)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer`, the given delimiter `delim` and all overlapping windows of length `subsequenceLength`.
`DataSet(DataSet s, int subsequenceLength)` Creates a new `DataSet` from a given `DataSet` and a given length `subsequenceLength`. This constructor enables you to use subsequences of the elements of a `DataSet`.
`DataSet(String annotation, Collection<Sequence> seqs)` Creates a new `DataSet` from a `Collection` of `Sequence`s and a given annotation.
`DataSet(String annotation, Sequence... seqs)` Creates a new `DataSet` from an array of `Sequence`s and a given annotation. This constructor is specially designed for the method `StatisticalModel.emitDataSet(int, int...)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static DataSet`	`diff(DataSet data, DataSet... samples)` This method computes the difference between the `DataSet` `data` and the `DataSet`s `samples`.
`Sequence[]`	`getAllElements()` Returns an array of `Sequence`s containing all elements of this `DataSet`.
`AlphabetContainer`	`getAlphabetContainer()` Returns the `AlphabetContainer` of this `DataSet`.
`String`	`getAnnotation()` Returns some annotation of the `DataSet`.
`static String`	`getAnnotation(DataSet... s)` Returns the annotation for an array of `DataSet`s.
`Hashtable<String,HashSet<String>>`	`getAnnotationTypesAndIdentifier()` This method returns all `SequenceAnnotation` types and the corresponding identifier which occur in this `DataSet`.
`double`	`getAverageElementLength()` Returns the average length of all `Sequence`s in this `DataSet`.
`DataSet`	`getCompositeDataSet(int[] starts, int[] lengths)` This method enables you to use only composite `Sequence`s of all elements in the current `DataSet`.
`Sequence`	`getElementAt(int i)` This method returns the element, i.e.
`int`	`getElementLength()` Returns the length of the elements, i.e.
`DataSet`	`getInfixDataSet(int start, int length)` This method enables you to use only an infix of all elements, i.e.
`int`	`getMaximalElementLength()` Returns the maximal length of an element, i.e.
`int`	`getMinimalElementLength()` Returns the minimal length of an element, i.e.
`int`	`getNumberOfElements()` Returns the number of elements, i.e.
`int`	`getNumberOfElementsWithLength(int len)` Returns the number of overlapping elements that can be extracted.
`double`	`getNumberOfElementsWithLength(int len, double[] weights)` Returns the weighted number of overlapping elements that can be extracted.
`DataSet`	`getPartialDataSet(int[]... indexes)` Returns a new `DataSet` that contains all elements of this `DataSet` that are specified by the supplied pairs of start and end indexes in `indexes`.
`DataSet`	`getPartialDataSet(int start, int end)` Returns a new `DataSet` that contains all elements of this `DataSet` that are specified by the supplied `start` (inclusive) and `end` (exclusive) indexes.
`DataSet`	`getReverseComplementaryDataSet()` Returns a `DataSet` that contains the reverse complement of all `Sequence`s in this `DataSet`.
`int[][]`	`getSequenceAnnotationIndexMatrix(String rowType, Hashtable<String,Integer> rowHash, String columnType, Hashtable<String,Integer> columnHash)` This method creates a matrix which contains the index of the `Sequence` with specific `SequenceAnnotation` combination or -1 if the `DataSet` does not contain any `Sequence` with such a combination.
`DataSet`	`getSuffixDataSet(int start)` This method enables you to use only a suffix of all elements, i.e.
`static DataSet`	`intersection(DataSet... samples)` This method computes the intersection between all elements/`DataSet` s of the array, i.e.
`boolean`	`isDiscreteDataSet()` This method indicates if all positions use discrete values.
`boolean`	`isSimpleDataSet()` This method indicates whether all random variables are defined over the same range, i.e.
`Iterator<Sequence>`	`iterator()`
`DataSet[]`	`partition(DataSet.PartitionMethod method, double... percentage)` This method partitions the elements, i.e.
`DataSet[]`	`partition(DataSet.PartitionMethod method, int k)` This method partitions the elements, i.e.
`Pair<DataSet[],double[][]>`	`partition(double[] sequenceWeights, DataSet.PartitionMethod method, double... percentage)` This method partitions the elements, i.e.
`Pair<DataSet[],double[][]>`	`partition(double[] sequenceWeights, DataSet.PartitionMethod method, int k)` This method partitions the elements, i.e.
`Pair<DataSet,double[]>`	`resize(double[] weights, int subsequenceLength)` Returns modified version of this data set with adjusted subsequence length.
`void`	`save(File f)` This method writes the `DataSet` to a file `f`.
`void`	`save(OutputStream stream, char commentChar, SequenceAnnotationParser p)` This method allows to write all `Sequence`s including their `SequenceAnnotation`s into a `OutputStream`.
`Pair<DataSet,double[]>`	`subSampling(double number, double[] weights)` Sub-samples sequences and corresponding weights from this `DataSet`.
`DataSet`	`subSampling(int number)` Randomly samples elements, i.e.
`String`	`toString()`
`static DataSet`	`union(DataSet... s)` Unites all `DataSet`s of the array `s`.
`static DataSet`	`union(DataSet[] s, boolean[] in)` This method unites all `DataSet`s of the array `s` regarding the array `in`.
`static Pair<DataSet,double[]>`	`union(DataSet[] s, double[][] weights, boolean[] in)` This method unites all `DataSet`s of the array `s` regarding the array `in` and sets the element length in the united `DataSet` to `subsequenceLength`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

- Constructor Detail
  - DataSet
```
public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se)
        throws WrongAlphabetException,
               EmptyDataSetException,
               WrongLengthException
```
    Creates a new DataSet from a StringExtractor using the given AlphabetContainer.
    
    Parameters:
    
    abc - the AlphabetContainer
    
    se - the StringExtractor
    
    Throws:
    
    WrongAlphabetException - if the AlphabetContainer is not suitable
    
    EmptyDataSetException - if the DataSet would be empty
    
    WrongLengthException - never happens (forwarded from DataSet(AlphabetContainer, AbstractStringExtractor, String, int) )
    
    See Also:
    
    DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
  - DataSet
```
public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               int subsequenceLength)
        throws WrongAlphabetException,
               WrongLengthException,
               EmptyDataSetException
```
    Creates a new DataSet from a StringExtractor using the given AlphabetContainer and all overlapping windows of length subsequenceLength.
    
    Parameters:
    
    abc - the AlphabetContainer
    
    se - the StringExtractor
    
    subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
    
    Throws:
    
    WrongAlphabetException - if the AlphabetContainer is not suitable
    
    WrongLengthException - if the subsequence length is not supported
    
    EmptyDataSetException - if the DataSet would be empty
    
    See Also:
    
    DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
  - DataSet
```
public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               String delim)
        throws WrongAlphabetException,
               EmptyDataSetException,
               WrongLengthException
```
    Creates a new DataSet from a StringExtractor using the given AlphabetContainer and a delimiter delim.
    
    Parameters:
    
    abc - the AlphabetContainer
    
    se - the StringExtractor
    
    delim - the delimiter for parsing the Strings
    
    Throws:
    
    WrongAlphabetException - if the AlphabetContainer is not suitable
    
    EmptyDataSetException - if the DataSet would be empty
    
    WrongLengthException - never happens (forwarded from DataSet(AlphabetContainer, AbstractStringExtractor, String, int) )
    
    See Also:
    
    DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
  - DataSet
```
public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               String delim,
               int subsequenceLength)
        throws EmptyDataSetException,
               WrongAlphabetException,
               WrongLengthException
```
    Creates a new DataSet from a StringExtractor using the given AlphabetContainer, the given delimiter delim and all overlapping windows of length subsequenceLength.
    
    Parameters:
    
    abc - the AlphabetContainer
    
    se - the StringExtractor
    
    delim - the delimiter for parsing the Strings
    
    subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
    
    Throws:
    
    WrongAlphabetException - if the AlphabetContainer is not suitable
    
    EmptyDataSetException - if the DataSet would be empty
    
    WrongLengthException - if the subsequence length is not supported
  - DataSet
```
public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               String delim,
               int subsequenceLength,
               double percentage)
        throws EmptyDataSetException,
               WrongAlphabetException,
               WrongLengthException
```
    Creates a new DataSet from a StringExtractor using the given AlphabetContainer, the given delimiter delim and all overlapping windows of length subsequenceLength.
    
    Parameters:
    
    abc - the AlphabetContainer
    
    se - the StringExtractor
    
    delim - the delimiter for parsing the Strings
    
    subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
    
    percentage - the percentage of Sequences allowed to be discarded due to WrongAlphabetException, all other constructors set this value to 0.
    
    Throws:
    
    WrongAlphabetException - if the AlphabetContainer is not suitable
    
    EmptyDataSetException - if the DataSet would be empty
    
    WrongLengthException - if the subsequence length is not supported
  - DataSet
```
public DataSet(DataSet s,
               int subsequenceLength)
        throws WrongLengthException
```
    Creates a new DataSet from a given DataSet and a given length subsequenceLength.
    This constructor enables you to use subsequences of the elements of a DataSet.
    
    It can also be used to ensure that all sequences that can be accessed by getElementAt(int) are real objects and do not have to be created at the invocation of the method. (The same holds for the DataSet.ElementEnumerator. In those cases both ways to access the Sequence are approximately equally fast.)
    
    Parameters:
    
    s - the given DataSet
    
    subsequenceLength - the new element length
    
    Throws:
    
    WrongLengthException - if something is wrong with subsequenceLength
  - DataSet
```
public DataSet(String annotation,
               Sequence... seqs)
        throws EmptyDataSetException,
               WrongAlphabetException
```
    Creates a new DataSet from an array of Sequences and a given annotation.
    This constructor is specially designed for the method StatisticalModel.emitDataSet(int, int...)
    
    Parameters:
    
    annotation - the annotation of the DataSet
    
    seqs - the Sequence(s)
    
    Throws:
    
    EmptyDataSetException - if the array seqs is null or the length is 0
    
    WrongAlphabetException - if the AlphabetContainers do not match
  - DataSet
```
public DataSet(String annotation,
               Collection<Sequence> seqs)
        throws EmptyDataSetException,
               WrongAlphabetException
```
    Creates a new DataSet from a Collection of Sequences and a given annotation.
    
    Parameters:
    
    annotation - the annotation of the DataSet
    
    seqs - the Sequence(s)
    
    Throws:
    
    EmptyDataSetException - if the array seqs is null or the length is 0
    
    WrongAlphabetException - if the AlphabetContainers do not match
- Method Detail
  - getAnnotation
```
public static final String getAnnotation(DataSet... s)
```
    Returns the annotation for an array of DataSets.
    
    Parameters:
    
    s - an array of DataSets
    
    Returns:
    
    the annotation
    
    See Also:
    
    getAnnotation()
  - diff
```
public static final DataSet diff(DataSet data,
                                 DataSet... samples)
                          throws EmptyDataSetException,
                                 WrongAlphabetException
```
    This method computes the difference between the DataSet data and the DataSets samples.
    
    Parameters:
    
    data - the minuend
    
    samples - the subtrahends
    
    Returns:
    
    the difference
    
    Throws:
    
    WrongAlphabetException - if the AlphabetContainers do not match, i.e., if the DataSets are from different domains
    
    EmptyDataSetException - if the difference is empty
  - intersection
```
public static final DataSet intersection(DataSet... samples)
                                  throws IllegalArgumentException,
                                         EmptyDataSetException
```
    This method computes the intersection between all elements/DataSet s of the array, i.e. it returns a DataSet containing only Sequences that are contained in all DataSets of the array.
    
    Parameters:
    
    samples - the array of DataSets
    
    Returns:
    
    the intersection of the elements/DataSets in the array
    
    Throws:
    
    IllegalArgumentException - if the elements of the array are from different domains
    
    EmptyDataSetException - if the intersection is empty
  - union
```
public static final DataSet union(DataSet[] s,
                                  boolean[] in)
                           throws IllegalArgumentException,
                                  EmptyDataSetException
```
    This method unites all DataSets of the array s regarding the array in.
    
    Parameters:
    
    s - the array of DataSets
    
    in - an array indicating which DataSet is used in the union, if in[i]==true the DataSet s[i] is used
    
    Returns:
    
    the united DataSet
    
    Throws:
    
    IllegalArgumentException - if s.length != in.length or the Alphabet s do not match
    
    EmptyDataSetException - if the union is empty
    
    See Also:
    
    union(DataSet[], double[][], boolean[])
  - union
```
public static final DataSet union(DataSet... s)
                           throws IllegalArgumentException
```
    Unites all DataSets of the array s.
    
    Parameters:
    
    s - the array of DataSets
    
    Returns:
    
    the united DataSet
    
    Throws:
    
    IllegalArgumentException - if the Alphabets do not match
    
    See Also:
    
    union(DataSet[], boolean[])
  - union
```
public static final Pair<DataSet,double[]> union(DataSet[] s,
                                                 double[][] weights,
                                                 boolean[] in)
                                          throws IllegalArgumentException,
                                                 EmptyDataSetException,
                                                 WrongLengthException
```
    This method unites all DataSets of the array s regarding the array in and sets the element length in the united DataSet to subsequenceLength.
    
    Parameters:
    
    s - the array of DataSets
    
    weights - the weights of the sequences in each data set can be null
    
    in - an array indicating which DataSet is used in the union, if in[i]==true the DataSet s[i] is used
    
    Returns:
    
    the united DataSet
    
    Throws:
    
    IllegalArgumentException - if s.length != in.length or the Alphabet s do not match
    
    EmptyDataSetException - if the union is empty
    
    WrongLengthException - if the united DataSet does not support this subsequenceLength
  - getAllElements
```
public Sequence[] getAllElements()
```
    Returns an array of Sequences containing all elements of this DataSet.
    
    Returns:
    
    all elements (Sequences) of this DataSet
    
    See Also:
    
    DataSet.ElementEnumerator
  - getAlphabetContainer
```
public final AlphabetContainer getAlphabetContainer()
```
    Returns the AlphabetContainer of this DataSet.
    
    Returns:
    
    the AlphabetContainer of this DataSet
  - getAnnotation
```
public final String getAnnotation()
```
    Returns some annotation of the DataSet.
    
    Returns:
    
    some annotation of the DataSet
  - getCompositeDataSet
```
public final DataSet getCompositeDataSet(int[] starts,
                                         int[] lengths)
                                  throws IllegalArgumentException
```
    This method enables you to use only composite Sequences of all elements in the current DataSet. Each composite Sequence will be build from one corresponding Sequence in this DataSet and all composite Sequences will be returned in a new DataSet.
    
    Parameters:
    
    starts - the start positions of the chunks
    
    lengths - the lengths of the chunks
    
    Returns:
    
    a composite DataSet
    
    Throws:
    
    IllegalArgumentException - if either starts or lengths or both in combination are not suitable
    
    See Also:
    
    Sequence.getCompositeSequence(AlphabetContainer, int[], int[])
  - getElementAt
```
public Sequence getElementAt(int i)
```
    This method returns the element, i.e. the Sequence, with index i. See also this comment.
    
    Parameters:
    
    i - the index of the element, i.e. the Sequence
    
    Returns:
    
    the element, i.e. the Sequence, with index i
  - getElementLength
```
public int getElementLength()
```
    Returns the length of the elements, i.e. the Sequences, in this DataSet.
    
    Returns:
    
    the length of the elements, i.e. the Sequences, in this DataSet
  - getAverageElementLength
```
public double getAverageElementLength()
```
    Returns the average length of all Sequences in this DataSet.
    
    Returns:
    
    the average length
  - getPartialDataSet
```
public final DataSet getPartialDataSet(int start,
                                       int end)
                                throws EmptyDataSetException
```
    Returns a new DataSet that contains all elements of this DataSet that are specified by the supplied start (inclusive) and end (exclusive) indexes.
    
    Parameters:
    
    start - the index of the first Sequence
    
    end - the index after the last Sequence
    
    Returns:
    
    the partial DataSet
    
    Throws:
    
    EmptyDataSetException - if start is equal to end
  - getPartialDataSet
```
public final DataSet getPartialDataSet(int[]... indexes)
                                throws EmptyDataSetException
```
    Returns a new DataSet that contains all elements of this DataSet that are specified by the supplied pairs of start and end indexes in indexes. Each indexes array must be of length 2, where the first entry specifies the start index of the first sequence within this DataSet (inclusive) and the second entry specifies the end index (exclusive). If some of the indexes specified in different indexes arrays overlap, the returned DataSet may contain doublettes of Sequences and may even be larger than this DataSet.
    
    Parameters:
    
    indexes - the indexes
    
    Returns:
    
    the partial DataSet
    
    Throws:
    
    EmptyDataSetException - if no indexes array is supplied or all ends are equal to the corresponding starts
  - getInfixDataSet
```
public final DataSet getInfixDataSet(int start,
                                     int length)
                              throws IllegalArgumentException
```
    This method enables you to use only an infix of all elements, i.e. the Sequences, in the current DataSet. The subsequences will be returned in an new DataSet.
    
    This method can also be used to create a DataSet of prefixes if the element length is not zero.
    
    Parameters:
    
    start - the start position of the infix
    
    length - the length of the infix, has to be positive
    
    Returns:
    
    a DataSet of the specified infixes
    
    Throws:
    
    IllegalArgumentException - if either start or length or both in combination are not suitable
  - getReverseComplementaryDataSet
```
public DataSet getReverseComplementaryDataSet()
                                       throws OperationNotSupportedException
```
    Returns a DataSet that contains the reverse complement of all Sequences in this DataSet.
    
    Returns:
    
    the reverse complements
    
    Throws:
    
    OperationNotSupportedException - if the AlphabetContainer of any of the Sequences in this DataSet is not complementable
  - getMinimalElementLength
```
public int getMinimalElementLength()
```
    Returns the minimal length of an element, i.e. a Sequence, in this DataSet.
    
    Returns:
    
    the minimal length of an element, i.e. a Sequence, in this DataSet
  - getMaximalElementLength
```
public int getMaximalElementLength()
```
    Returns the maximal length of an element, i.e. a Sequence, in this DataSet.
    
    Returns:
    
    the maximal length of an element, i.e. a Sequence, in this DataSet
  - getNumberOfElements
```
public int getNumberOfElements()
```
    Returns the number of elements, i.e. the Sequences, in this DataSet.
    
    Returns:
    
    the number of elements, i.e. the Sequences, in this DataSet
  - iterator
```
public Iterator<Sequence> iterator()
```
    Specified by:
    
    iterator in interface Iterable<Sequence>
  - getNumberOfElementsWithLength
```
public int getNumberOfElementsWithLength(int len)
                                  throws WrongLengthException
```
    Returns the number of overlapping elements that can be extracted.
    
    Parameters:
    
    len - the length of the elements
    
    Returns:
    
    the number of elements with the specified length
    
    Throws:
    
    WrongLengthException - if the given length is bigger than the minimal element length
    
    See Also:
    
    getNumberOfElementsWithLength(int, double[])
  - getNumberOfElementsWithLength
```
public double getNumberOfElementsWithLength(int len,
                                            double[] weights)
                                     throws WrongLengthException,
                                            IllegalArgumentException
```
    Returns the weighted number of overlapping elements that can be extracted.
    
    Parameters:
    
    len - the length of the elements
    
    weights - the weights of each element of the data set (see getElementAt(int)), can be null
    
    Returns:
    
    the weighted number of elements with the specified length
    
    Throws:
    
    WrongLengthException - if the given length is bigger than the minimal element length
    
    IllegalArgumentException - if the weights have a wrong dimension
  - getSuffixDataSet
```
public final DataSet getSuffixDataSet(int start)
                               throws IllegalArgumentException
```
    This method enables you to use only a suffix of all elements, i.e. the Sequence, in the current DataSet. The subsequences will be returned in an new DataSet.
    
    Parameters:
    
    start - the start position of the suffix
    
    Returns:
    
    a DataSet of specified suffixes
    
    Throws:
    
    IllegalArgumentException - if start is not suitable
  - isSimpleDataSet
```
public final boolean isSimpleDataSet()
```
    This method indicates whether all random variables are defined over the same range, i.e. all positions use the same (fixed) alphabet.
    
    Returns:
    
    true if the DataSet is simple, false otherwise
    
    See Also:
    
    AlphabetContainer.isSimple()
  - isDiscreteDataSet
```
public final boolean isDiscreteDataSet()
```
    This method indicates if all positions use discrete values.
    
    Returns:
    
    true if the DataSet is discrete, false otherwise
    
    See Also:
    
    AlphabetContainer.isDiscrete()
  - partition
```
public DataSet[] partition(DataSet.PartitionMethod method,
                           double... percentage)
                    throws IllegalArgumentException,
                           EmptyDataSetException
```
    This method partitions the elements, i.e. the Sequences, of the DataSet in distinct parts where each part holds the corresponding percentage given in the array percentage.
    
    Parameters:
    
    method - the method how to partition the DataSet (partitioning criterion)
    
    percentage - the array of percentages for each "subsample"
    
    Returns:
    
    the array of partitioned DataSets
    
    Throws:
    
    IllegalArgumentException - if something with the percentages is not correct ( sum != 1 or one value is not in [0,1])
    
    EmptyDataSetException - if at least one of the created partitions is empty
    
    See Also:
    
    DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
  - partition
```
public Pair<DataSet[],double[][]> partition(double[] sequenceWeights,
                                            DataSet.PartitionMethod method,
                                            double... percentage)
                                     throws IllegalArgumentException,
                                            EmptyDataSetException
```
    This method partitions the elements, i.e. the Sequences, of the DataSet and the corresponding weights in distinct parts where each part holds the corresponding percentage given in the array percentage.
    
    Parameters:
    
    sequenceWeights - the weights for the sequences (might be null)
    
    method - the method how to partition the DataSet (partitioning criterion)
    
    percentage - the array of percentages for each "subsample"
    
    Returns:
    
    a Pair containing an array of partitioned DataSets and an array of partitioned sequence weights
    
    Throws:
    
    IllegalArgumentException - if something with the percentages is not correct ( sum != 1 or one value is not in [0,1])
    
    EmptyDataSetException - if at least one of the created partitions is empty
    
    See Also:
    
    DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
  - partition
```
public DataSet[] partition(DataSet.PartitionMethod method,
                           int k)
                    throws IllegalArgumentException,
                           EmptyDataSetException
```
    This method partitions the elements, i.e. the Sequences, of the DataSet in k distinct parts.
    
    Parameters:
    
    k - the number of distinct parts
    
    method - the method how to partition the DataSet (partitioning criterion)
    
    Returns:
    
    the array of partitioned DataSets
    
    Throws:
    
    IllegalArgumentException - if k is not correct
    
    EmptyDataSetException - if at least one of the created partitions is empty
    
    See Also:
    
    DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
  - partition
```
public Pair<DataSet[],double[][]> partition(double[] sequenceWeights,
                                            DataSet.PartitionMethod method,
                                            int k)
                                     throws IllegalArgumentException,
                                            EmptyDataSetException
```
    This method partitions the elements, i.e. the Sequences, of the DataSet and the corresponding weights in k distinct parts.
    
    Parameters:
    
    sequenceWeights - the weights for the sequences (might be null)
    
    k - the number of distinct parts
    
    method - the method how to partition the DataSet (partitioning criterion)
    
    Returns:
    
    a Pair containing an array of partitioned DataSets and an array of partitioned sequence weights
    
    Throws:
    
    IllegalArgumentException - if k is not correct
    
    EmptyDataSetException - if at least one of the created partitions is empty
    
    See Also:
    
    DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
  - subSampling
```
public DataSet subSampling(int number)
                    throws EmptyDataSetException
```
    Randomly samples elements, i.e. Sequences, from the set of all elements, i.e. the Sequences, contained in this DataSet.
    Depending on whether this DataSet is chosen to contain overlapping elements (windows of length subsequenceLength) or not, those elements (overlapping windows, whole sequences) are subsampled.
    
    Parameters:
    
    number - the number of Sequences that should be drawn from the contained set of Sequences (with replacement)
    
    Returns:
    
    a new DataSet containing the drawn Sequences
    
    Throws:
    
    EmptyDataSetException - if number is not positive
  - subSampling
```
public Pair<DataSet,double[]> subSampling(double number,
                                          double[] weights)
                                   throws EmptyDataSetException
```
    Sub-samples sequences and corresponding weights from this DataSet. If weights are supplied, sequences are sampled until the sum of their weights exceeds the given number. Otherwise, number sequences are sampled.
    
    Parameters:
    
    number - the number (or total weight) of sampled sequences
    
    weights - the weights for the sequences, in the same order as sequences in this DataSet
    
    Returns:
    
    a pair of a data set and associated weights, weights are null if the supplied weights also have been null
    
    Throws:
    
    EmptyDataSetException - if the number has been too small to sample a single sequence
  - resize
```
public final Pair<DataSet,double[]> resize(double[] weights,
                                           int subsequenceLength)
                                    throws WrongLengthException
```
    Returns modified version of this data set with adjusted subsequence length. Weights are copied for different sub-sequences of the same original sequence.
    
    Parameters:
    
    weights - the original weights, may be null
    
    subsequenceLength - the new subsequence length
    
    Returns:
    
    the data set of the new subsequence length and the copied weights
    
    Throws:
    
    WrongLengthException - if the supplied subsequence length is not possible for this data set (e.g., shorter than the shortest sequence but not 0)
  - save
```
public final void save(File f)
                throws IOException
```
    This method writes the DataSet to a file f.
    
    Parameters:
    
    f - the File
    
    Throws:
    
    IOException - if something went wrong with the file
    
    See Also:
    
    save(OutputStream, char, SequenceAnnotationParser)
  - save
```
public final void save(OutputStream stream,
                       char commentChar,
                       SequenceAnnotationParser p)
                throws IOException
```
    This method allows to write all Sequences including their SequenceAnnotations into a OutputStream. The SequenceAnnotations are parsed using the SequenceAnnotationParser.
    
    Parameters:
    
    stream - the stream which is used to write the DataSet
    
    commentChar - the character that marks comment lines
    
    p - the parser for the SequenceAnnotations of the Sequences
    
    Throws:
    
    IOException - if something went wrong while writing into the stream.
    
    See Also:
    
    SequenceAnnotationParser.parseAnnotationToComment(char, SequenceAnnotation...)
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object
  - getAnnotationTypesAndIdentifier
```
public Hashtable<String,HashSet<String>> getAnnotationTypesAndIdentifier()
```
    This method returns all SequenceAnnotation types and the corresponding identifier which occur in this DataSet.
    
    Returns:
    
    a Hashtable with key = SequenceAnnotation type and identifier = SequenceAnnotation identifier
    
    See Also:
    
    SequenceAnnotation
  - getSequenceAnnotationIndexMatrix
```
public int[][] getSequenceAnnotationIndexMatrix(String rowType,
                                                Hashtable<String,Integer> rowHash,
                                                String columnType,
                                                Hashtable<String,Integer> columnHash)
```
    This method creates a matrix which contains the index of the Sequence with specific SequenceAnnotation combination or -1 if the DataSet does not contain any Sequence with such a combination. The rows and columns are indexed according to the Hashtables.
    
    Here is a short example, how to interpret the returned matrix:
```
 int[][] matrix = s.getSequenceAnnotationIndexMatrix( rowType, rowHash, columnType, columnHash )
 
 if( matrix[i][j] < 0 ) {
        System.out.println( "There is no Sequence in the DataSet with this SequenceAnnotation combination");
 } else {
        System.out.println( "This is the Sequence: " + s.getElementAt( matrix[i][j] ) );
 }
 
```
    Parameters:
    
    rowType - the SequenceAnnotation type for the rows
    
    rowHash - a Hashtable of SequenceAnnotation identifier and indices for the rows
    
    columnType - the SequenceAnnotation type for the columns
    
    columnHash - a Hashtable of SequenceAnnotation identifier and indices for the columns
    
    Returns:
    
    a matrix with the indices of the Sequences with each specific combination of SequenceAnnotation for rowType and columnType and -1 if this combination does not exist in the DataSet
    
    See Also:
    
    getAnnotationTypesAndIdentifier(), ToolBox.parseHashSet2IndexHashtable(HashSet)

Class DataSet

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Constructor Detail

DataSet

DataSet

DataSet

DataSet

DataSet

DataSet

DataSet

DataSet

Method Detail

getAnnotation

diff

intersection

union

union

union

getAllElements

getAlphabetContainer

getAnnotation

getCompositeDataSet

getElementAt

getElementLength

getAverageElementLength

getPartialDataSet

getPartialDataSet

getInfixDataSet

getReverseComplementaryDataSet

getMinimalElementLength

getMaximalElementLength

getNumberOfElements

iterator

getNumberOfElementsWithLength

getNumberOfElementsWithLength

getSuffixDataSet

isSimpleDataSet

isDiscreteDataSet

partition

partition

partition

partition

subSampling

subSampling

resize

save

save

toString

getAnnotationTypesAndIdentifier

getSequenceAnnotationIndexMatrix