public class DataSet extends Object implements Iterable<Sequence>
Sequences. All Sequences
in a DataSet have to have the same AlphabetContainer. The
Sequences may have different lengths.
Sequence is used, where the
external alphabet is converted to integral numerical values. The class
DataSet knows about this coding via instances of class
AlphabetContainer and accordingly Alphabet.
DataSet. If one needs random access there is the method
getElementAt(int). For fast sequential access it is recommended to
use an DataSet.ElementEnumerator.
DataSet is immutable.AlphabetContainer,
Alphabet,
Sequence| Modifier and Type | Class and Description |
|---|---|
static class |
DataSet.ElementEnumerator
This class can be used to have a fast sequential access to a
DataSet. |
static class |
DataSet.PartitionMethod
This
enum defines different partition methods for a
DataSet. |
static class |
DataSet.WeightedDataSetFactory
|
| Constructor and Description |
|---|
DataSet(AlphabetContainer abc,
AbstractStringExtractor se)
|
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
int subsequenceLength)
Creates a new
DataSet from a StringExtractor
using the given AlphabetContainer and all overlapping windows of
length subsequenceLength. |
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim)
Creates a new
DataSet from a StringExtractor
using the given AlphabetContainer and a delimiter
delim. |
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim,
int subsequenceLength)
Creates a new
DataSet from a StringExtractor
using the given AlphabetContainer, the given delimiter
delim and all overlapping windows of length
subsequenceLength. |
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim,
int subsequenceLength,
double percentage)
Creates a new
DataSet from a StringExtractor
using the given AlphabetContainer, the given delimiter
delim and all overlapping windows of length
subsequenceLength. |
DataSet(DataSet s,
int subsequenceLength)
|
DataSet(String annotation,
Collection<Sequence> seqs)
|
DataSet(String annotation,
Sequence... seqs)
Creates a new
DataSet from an array of Sequences and a
given annotation.This constructor is specially designed for the method StatisticalModel.emitDataSet(int, int...) |
| Modifier and Type | Method and Description |
|---|---|
static DataSet |
diff(DataSet data,
DataSet... samples)
|
Sequence[] |
getAllElements()
|
AlphabetContainer |
getAlphabetContainer()
Returns the
AlphabetContainer of this DataSet. |
String |
getAnnotation()
Returns some annotation of the
DataSet. |
static String |
getAnnotation(DataSet... s)
Returns the annotation for an array of
DataSets. |
Hashtable<String,HashSet<String>> |
getAnnotationTypesAndIdentifier()
This method returns all
SequenceAnnotation types and the corresponding
identifier which occur in this DataSet. |
double |
getAverageElementLength()
|
DataSet |
getCompositeDataSet(int[] starts,
int[] lengths)
|
Sequence |
getElementAt(int i)
This method returns the element, i.e.
|
int |
getElementLength()
Returns the length of the elements, i.e.
|
DataSet |
getInfixDataSet(int start,
int length)
This method enables you to use only an infix of all elements, i.e.
|
int |
getMaximalElementLength()
Returns the maximal length of an element, i.e.
|
int |
getMinimalElementLength()
Returns the minimal length of an element, i.e.
|
int |
getNumberOfElements()
Returns the number of elements, i.e.
|
int |
getNumberOfElementsWithLength(int len)
Returns the number of overlapping elements that can be extracted.
|
double |
getNumberOfElementsWithLength(int len,
double[] weights)
Returns the weighted number of overlapping elements that can be extracted.
|
DataSet |
getPartialDataSet(int[]... indexes)
|
DataSet |
getPartialDataSet(int start,
int end)
|
DataSet |
getReverseComplementaryDataSet()
|
int[][] |
getSequenceAnnotationIndexMatrix(String rowType,
Hashtable<String,Integer> rowHash,
String columnType,
Hashtable<String,Integer> columnHash)
This method creates a matrix which contains the index of the
Sequence with specific SequenceAnnotation
combination or -1 if the DataSet does not contain any Sequence with such a combination. |
DataSet |
getSuffixDataSet(int start)
This method enables you to use only a suffix of all elements, i.e.
|
static DataSet |
intersection(DataSet... samples)
This method computes the intersection between all elements/
DataSet
s of the array, i.e. |
boolean |
isDiscreteDataSet()
This method indicates if all positions use discrete values.
|
boolean |
isSimpleDataSet()
This method indicates whether all random variables are defined over the
same range, i.e.
|
Iterator<Sequence> |
iterator() |
DataSet[] |
partition(DataSet.PartitionMethod method,
double... percentage)
This method partitions the elements, i.e.
|
DataSet[] |
partition(DataSet.PartitionMethod method,
int k)
This method partitions the elements, i.e.
|
Pair<DataSet[],double[][]> |
partition(double[] sequenceWeights,
DataSet.PartitionMethod method,
double... percentage)
This method partitions the elements, i.e.
|
Pair<DataSet[],double[][]> |
partition(double[] sequenceWeights,
DataSet.PartitionMethod method,
int k)
This method partitions the elements, i.e.
|
Pair<DataSet,double[]> |
resize(double[] weights,
int subsequenceLength)
Returns modified version of this data set with adjusted subsequence length.
|
void |
save(File f)
This method writes the
DataSet to a file f. |
void |
save(OutputStream stream,
char commentChar,
SequenceAnnotationParser p)
|
Pair<DataSet,double[]> |
subSampling(double number,
double[] weights)
Sub-samples sequences and corresponding weights from this
DataSet. |
DataSet |
subSampling(int number)
Randomly samples elements, i.e.
|
String |
toString() |
static DataSet |
union(DataSet... s)
Unites all
DataSets of the array s. |
static DataSet |
union(DataSet[] s,
boolean[] in)
|
static Pair<DataSet,double[]> |
union(DataSet[] s,
double[][] weights,
boolean[] in)
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitforEach, spliteratorpublic DataSet(AlphabetContainer abc, AbstractStringExtractor se) throws WrongAlphabetException, EmptyDataSetException, WrongLengthException
abc - the AlphabetContainerse - the StringExtractorWrongAlphabetException - if the AlphabetContainer is not suitableEmptyDataSetException - if the DataSet would be emptyWrongLengthException - never happens (forwarded from
DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
)DataSet(AlphabetContainer, AbstractStringExtractor, String, int)public DataSet(AlphabetContainer abc, AbstractStringExtractor se, int subsequenceLength) throws WrongAlphabetException, WrongLengthException, EmptyDataSetException
DataSet from a StringExtractor
using the given AlphabetContainer and all overlapping windows of
length subsequenceLength.abc - the AlphabetContainerse - the StringExtractorsubsequenceLength - the length of the window sliding on the String of
se, if len is 0 (zero) then the
Sequences are used as given from the
StringExtractorWrongAlphabetException - if the AlphabetContainer is not suitableWrongLengthException - if the subsequence length is not supportedEmptyDataSetException - if the DataSet would be emptyDataSet(AlphabetContainer, AbstractStringExtractor, String, int)public DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim) throws WrongAlphabetException, EmptyDataSetException, WrongLengthException
DataSet from a StringExtractor
using the given AlphabetContainer and a delimiter
delim.abc - the AlphabetContainerse - the StringExtractordelim - the delimiter for parsing the StringsWrongAlphabetException - if the AlphabetContainer is not suitableEmptyDataSetException - if the DataSet would be emptyWrongLengthException - never happens (forwarded from
DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
)DataSet(AlphabetContainer, AbstractStringExtractor, String,
int)public DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim, int subsequenceLength) throws EmptyDataSetException, WrongAlphabetException, WrongLengthException
DataSet from a StringExtractor
using the given AlphabetContainer, the given delimiter
delim and all overlapping windows of length
subsequenceLength.abc - the AlphabetContainerse - the StringExtractordelim - the delimiter for parsing the StringssubsequenceLength - the length of the window sliding on the String of
se, if len is 0 (zero) then the
Sequences are used as given from the
StringExtractorWrongAlphabetException - if the AlphabetContainer is not suitableEmptyDataSetException - if the DataSet would be emptyWrongLengthException - if the subsequence length is not supportedpublic DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim, int subsequenceLength, double percentage) throws EmptyDataSetException, WrongAlphabetException, WrongLengthException
DataSet from a StringExtractor
using the given AlphabetContainer, the given delimiter
delim and all overlapping windows of length
subsequenceLength.abc - the AlphabetContainerse - the StringExtractordelim - the delimiter for parsing the StringssubsequenceLength - the length of the window sliding on the String of
se, if len is 0 (zero) then the
Sequences are used as given from the
StringExtractorpercentage - the percentage of Sequences allowed to be discarded due to WrongAlphabetException, all other constructors set this value to 0.WrongAlphabetException - if the AlphabetContainer is not suitableEmptyDataSetException - if the DataSet would be emptyWrongLengthException - if the subsequence length is not supportedpublic DataSet(DataSet s, int subsequenceLength) throws WrongLengthException
DataSet from a given DataSet and a given
length subsequenceLength.DataSet.
getElementAt(int) are real objects and do not have to be created
at the invocation of the method. (The same holds for the
DataSet.ElementEnumerator. In those cases both ways to access the
Sequence are approximately equally fast.)s - the given DataSetsubsequenceLength - the new element lengthWrongLengthException - if something is wrong with subsequenceLengthpublic DataSet(String annotation, Sequence... seqs) throws EmptyDataSetException, WrongAlphabetException
DataSet from an array of Sequences and a
given annotation.StatisticalModel.emitDataSet(int, int...)annotation - the annotation of the DataSetseqs - the Sequence(s)EmptyDataSetException - if the array seqs is null or the
length is 0WrongAlphabetException - if the AlphabetContainers do not matchpublic DataSet(String annotation, Collection<Sequence> seqs) throws EmptyDataSetException, WrongAlphabetException
annotation - the annotation of the DataSetseqs - the Sequence(s)EmptyDataSetException - if the array seqs is null or the
length is 0WrongAlphabetException - if the AlphabetContainers do not matchpublic static final String getAnnotation(DataSet... s)
DataSets.s - an array of DataSetsgetAnnotation()public static final DataSet diff(DataSet data, DataSet... samples) throws EmptyDataSetException, WrongAlphabetException
data - the minuendsamples - the subtrahendsWrongAlphabetException - if the AlphabetContainers do not match, i.e., if the DataSets are from different domainsEmptyDataSetException - if the difference is emptypublic static final DataSet intersection(DataSet... samples) throws IllegalArgumentException, EmptyDataSetException
DataSet
s of the array, i.e. it returns a DataSet containing only
Sequences that are contained in all DataSets of the array.samples - the array of DataSetsDataSets in the arrayIllegalArgumentException - if the elements of the array are from different domainsEmptyDataSetException - if the intersection is emptypublic static final DataSet union(DataSet[] s, boolean[] in) throws IllegalArgumentException, EmptyDataSetException
s - the array of DataSetsin - an array indicating which DataSet is used in the union,
if in[i]==true the DataSet
s[i] is usedDataSetIllegalArgumentException - if s.length != in.length or the Alphabet
s do not matchEmptyDataSetException - if the union is emptyunion(DataSet[], double[][], boolean[])public static final DataSet union(DataSet... s) throws IllegalArgumentException
DataSets of the array s.s - the array of DataSetsDataSetIllegalArgumentException - if the Alphabets do not matchunion(DataSet[], boolean[])public static final Pair<DataSet,double[]> union(DataSet[] s, double[][] weights, boolean[] in) throws IllegalArgumentException, EmptyDataSetException, WrongLengthException
DataSets of the array s
regarding the array in and sets the element length in the
united DataSet to subsequenceLength.s - the array of DataSetsweights - the weights of the sequences in each data set can be nullin - an array indicating which DataSet is used in the union,
if in[i]==true the DataSet
s[i] is usedDataSetIllegalArgumentException - if s.length != in.length or the Alphabet
s do not matchEmptyDataSetException - if the union is emptyWrongLengthException - if the united DataSet does not support this
subsequenceLengthpublic Sequence[] getAllElements()
Sequences) of this DataSetDataSet.ElementEnumeratorpublic final AlphabetContainer getAlphabetContainer()
AlphabetContainer of this DataSet.AlphabetContainer of this DataSetpublic final String getAnnotation()
DataSet.DataSetpublic final DataSet getCompositeDataSet(int[] starts, int[] lengths) throws IllegalArgumentException
Sequences of all
elements in the current DataSet. Each composite Sequence
will be build from one corresponding Sequence in this
DataSet and all composite Sequences
will be returned in a new DataSet.starts - the start positions of the chunkslengths - the lengths of the chunksDataSetIllegalArgumentException - if either starts or lengths or both
in combination are not suitableSequence.getCompositeSequence(AlphabetContainer, int[], int[])public Sequence getElementAt(int i)
public int getElementLength()
public double getAverageElementLength()
public final DataSet getPartialDataSet(int start, int end) throws EmptyDataSetException
DataSet that contains all elements of this DataSet that are specified
by the supplied start (inclusive) and end (exclusive) indexes.start - the index of the first Sequenceend - the index after the last SequenceDataSetEmptyDataSetException - if start is equal to endpublic final DataSet getPartialDataSet(int[]... indexes) throws EmptyDataSetException
DataSet that contains all elements of this DataSet that are specified
by the supplied pairs of start and end indexes in indexes.
Each indexes array must be of length 2, where the first entry specifies the start index
of the first sequence within this DataSet (inclusive) and the second entry specifies the end index (exclusive).
If some of the indexes specified in different indexes arrays overlap, the returned DataSet may
contain doublettes of Sequences and may even be larger than this DataSet.indexes - the indexesDataSetEmptyDataSetException - if no indexes array is supplied or all ends are equal to the corresponding startspublic final DataSet getInfixDataSet(int start, int length) throws IllegalArgumentException
Sequences, in the current DataSet. The subsequences will
be returned in an new DataSet.
DataSet of prefixes if
the element length is not zero.start - the start position of the infixlength - the length of the infix, has to be positiveDataSet of the specified infixesIllegalArgumentException - if either start or length or both
in combination are not suitablepublic DataSet getReverseComplementaryDataSet() throws OperationNotSupportedException
OperationNotSupportedException - if the AlphabetContainer of any of the Sequences in this DataSet
is not complementablepublic int getMinimalElementLength()
public int getMaximalElementLength()
public int getNumberOfElements()
public int getNumberOfElementsWithLength(int len)
throws WrongLengthException
len - the length of the elementsWrongLengthException - if the given length is bigger than the minimal element lengthgetNumberOfElementsWithLength(int, double[])public double getNumberOfElementsWithLength(int len,
double[] weights)
throws WrongLengthException,
IllegalArgumentException
len - the length of the elementsweights - the weights of each element of the data set (see getElementAt(int)), can be nullWrongLengthException - if the given length is bigger than the minimal element lengthIllegalArgumentException - if the weights have a wrong dimensionpublic final DataSet getSuffixDataSet(int start) throws IllegalArgumentException
Sequence, in the current DataSet. The subsequences will be
returned in an new DataSet.start - the start position of the suffixDataSet of specified suffixesIllegalArgumentException - if start is not suitablepublic final boolean isSimpleDataSet()
true if the DataSet is simple,
false otherwiseAlphabetContainer.isSimple()public final boolean isDiscreteDataSet()
true if the DataSet is discrete,
false otherwiseAlphabetContainer.isDiscrete()public DataSet[] partition(DataSet.PartitionMethod method, double... percentage) throws IllegalArgumentException, EmptyDataSetException
Sequences, of the
DataSet in distinct parts where each part holds the corresponding
percentage given in the array percentage.method - the method how to partition the DataSet (partitioning
criterion)percentage - the array of percentages for each "subsample"DataSetsIllegalArgumentException - if something with the percentages is not correct (
sum != 1 or one value is not in
[0,1])EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLSpublic Pair<DataSet[],double[][]> partition(double[] sequenceWeights, DataSet.PartitionMethod method, double... percentage) throws IllegalArgumentException, EmptyDataSetException
Sequences, of the
DataSet and the corresponding weights in distinct parts where each part holds the corresponding
percentage given in the array percentage.sequenceWeights - the weights for the sequences (might be null)method - the method how to partition the DataSet (partitioning
criterion)percentage - the array of percentages for each "subsample"Pair containing an array of partitioned DataSets and an array of partitioned sequence weightsIllegalArgumentException - if something with the percentages is not correct (
sum != 1 or one value is not in
[0,1])EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLSpublic DataSet[] partition(DataSet.PartitionMethod method, int k) throws IllegalArgumentException, EmptyDataSetException
k - the number of distinct partsmethod - the method how to partition the DataSet (partitioning
criterion)DataSetsIllegalArgumentException - if k is not correctEmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLSpublic Pair<DataSet[],double[][]> partition(double[] sequenceWeights, DataSet.PartitionMethod method, int k) throws IllegalArgumentException, EmptyDataSetException
Sequences, of the
DataSet and the corresponding weights in k distinct parts.sequenceWeights - the weights for the sequences (might be null)k - the number of distinct partsmethod - the method how to partition the DataSet (partitioning
criterion)Pair containing an array of partitioned DataSets and an array of partitioned sequence weightsIllegalArgumentException - if k is not correctEmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLSpublic DataSet subSampling(int number) throws EmptyDataSetException
Sequences, from the set of all
elements, i.e. the Sequences, contained in this DataSet. DataSet is chosen to contain overlapping
elements (windows of length subsequenceLength) or not, those
elements (overlapping windows, whole sequences) are subsampled.number - the number of Sequences that should be drawn from the
contained set of Sequences (with replacement)DataSet containing the drawn SequencesEmptyDataSetException - if number is not positivepublic Pair<DataSet,double[]> subSampling(double number, double[] weights) throws EmptyDataSetException
DataSet.
If weights are supplied, sequences are sampled until the sum of their weights exceeds the
given number. Otherwise, number sequences are sampled.number - the number (or total weight) of sampled sequencesweights - the weights for the sequences, in the same order as sequences in this DataSetnull if the supplied weights also have been nullEmptyDataSetException - if the number has been too small to sample a single sequencepublic final Pair<DataSet,double[]> resize(double[] weights, int subsequenceLength) throws WrongLengthException
weights - the original weights, may be nullsubsequenceLength - the new subsequence lengthWrongLengthException - if the supplied subsequence length is not possible for this data set (e.g., shorter than the shortest sequence but not 0)public final void save(File f) throws IOException
DataSet to a file f.f - the FileIOException - if something went wrong with the filesave(OutputStream, char, SequenceAnnotationParser)public final void save(OutputStream stream, char commentChar, SequenceAnnotationParser p) throws IOException
Sequences including their
SequenceAnnotations into a OutputStream. The
SequenceAnnotations are parsed using the
SequenceAnnotationParser.stream - the stream which is used to write the DataSetcommentChar - the character that marks comment linesp - the parser for the SequenceAnnotations of the
SequencesIOException - if something went wrong while writing into the stream.SequenceAnnotationParser.parseAnnotationToComment(char,
SequenceAnnotation...)public Hashtable<String,HashSet<String>> getAnnotationTypesAndIdentifier()
SequenceAnnotation types and the corresponding
identifier which occur in this DataSet.Hashtable with key = SequenceAnnotation type and identifier = SequenceAnnotation identifierSequenceAnnotationpublic int[][] getSequenceAnnotationIndexMatrix(String rowType, Hashtable<String,Integer> rowHash, String columnType, Hashtable<String,Integer> columnHash)
Sequence with specific SequenceAnnotation
combination or -1 if the DataSet does not contain any Sequence with such a combination. The rows and
columns are indexed according to the Hashtables.
int[][] matrix = s.getSequenceAnnotationIndexMatrix( rowType, rowHash, columnType, columnHash )
if( matrix[i][j] < 0 ) {
System.out.println( "There is no Sequence in the DataSet with this SequenceAnnotation combination");
} else {
System.out.println( "This is the Sequence: " + s.getElementAt( matrix[i][j] ) );
}
rowType - the SequenceAnnotation type for the rowsrowHash - a Hashtable of SequenceAnnotation identifier and indices for the rowscolumnType - the SequenceAnnotation type for the columnscolumnHash - a Hashtable of SequenceAnnotation identifier and indices for the columnsSequences with each specific combination of
SequenceAnnotation for rowType and columnType and -1
if this combination does not exist in the DataSetgetAnnotationTypesAndIdentifier(),
ToolBox.parseHashSet2IndexHashtable(HashSet)