|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectde.jstacs.data.DataSet
public class DataSet
This is the class for any data set of Sequences. All Sequences
in a DataSet have to have the same AlphabetContainer. The
Sequences may have different lengths.
For the internal representation the class Sequence is used, where the
external alphabet is converted to integral numerical values. The class
DataSet knows about this coding via instances of class
AlphabetContainer and accordingly Alphabet.
There are different ways to access the elements of a
DataSet. If one needs random access there is the method
getElementAt(int). For fast sequential access it is recommended to
use an DataSet.ElementEnumerator.
DataSet is immutable.
AlphabetContainer,
Alphabet,
Sequence| Nested Class Summary | |
|---|---|
static class |
DataSet.ElementEnumerator
This class can be used to have a fast sequential access to a DataSet. |
static class |
DataSet.PartitionMethod
This enum defines different partition methods for a
DataSet. |
static class |
DataSet.WeightedDataSetFactory
This class enables you to eliminate Sequences that occur more
than once in one or more DataSets. |
| Constructor Summary | |
|---|---|
DataSet(AlphabetContainer abc,
AbstractStringExtractor se)
Creates a new DataSet from a StringExtractor
using the given AlphabetContainer. |
|
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
int subsequenceLength)
Creates a new DataSet from a StringExtractor
using the given AlphabetContainer and all overlapping windows of
length subsequenceLength. |
|
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim)
Creates a new DataSet from a StringExtractor
using the given AlphabetContainer and a delimiter
delim. |
|
DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim,
int subsequenceLength)
Creates a new DataSet from a StringExtractor
using the given AlphabetContainer, the given delimiter
delim and all overlapping windows of length
subsequenceLength. |
|
DataSet(DataSet s,
int subsequenceLength)
Creates a new DataSet from a given DataSet and a given
length subsequenceLength. |
|
DataSet(String annotation,
Sequence... seqs)
Creates a new DataSet from an array of Sequences and a
given annotation. |
|
| Method Summary | |
|---|---|
static DataSet |
diff(DataSet data,
DataSet... samples)
This method computes the difference between the DataSet data and
the DataSets samples. |
Sequence[] |
getAllElements()
Returns an array of Sequences containing all elements of this
DataSet. |
AlphabetContainer |
getAlphabetContainer()
Returns the AlphabetContainer of this DataSet. |
String |
getAnnotation()
Returns some annotation of the DataSet. |
static String |
getAnnotation(DataSet... s)
Returns the annotation for an array of DataSets. |
Hashtable<String,HashSet<String>> |
getAnnotationTypesAndIdentifier()
This method returns all SequenceAnnotation types and the corresponding
identifier which occur in this DataSet. |
double |
getAverageElementLength()
Returns the average length of all Sequences in this DataSet. |
DataSet |
getCompositeDataSet(int[] starts,
int[] lengths)
This method enables you to use only composite Sequences of all
elements in the current DataSet. |
Sequence |
getElementAt(int i)
This method returns the element, i.e. the Sequence, with index
i. |
int |
getElementLength()
Returns the length of the elements, i.e. the Sequences, in this
DataSet. |
DataSet |
getInfixDataSet(int start,
int length)
This method enables you to use only an infix of all elements, i.e. the Sequences, in the current DataSet. |
int |
getMaximalElementLength()
Returns the maximal length of an element, i.e. a Sequence, in
this DataSet. |
int |
getMinimalElementLength()
Returns the minimal length of an element, i.e. a Sequence, in
this DataSet. |
int |
getNumberOfElements()
Returns the number of elements, i.e. the Sequences, in this
DataSet. |
int |
getNumberOfElementsWithLength(int len)
Returns the number of overlapping elements that can be extracted. |
double |
getNumberOfElementsWithLength(int len,
double[] weights)
Returns the weighted number of overlapping elements that can be extracted. |
DataSet |
getReverseComplementaryDataSet()
Returns a DataSet that contains the reverse complement of all Sequences in
this DataSet. |
int[][] |
getSequenceAnnotationIndexMatrix(String rowType,
Hashtable<String,Integer> rowHash,
String columnType,
Hashtable<String,Integer> columnHash)
This method creates a matrix which contains the index of the Sequence with specific SequenceAnnotation
combination or -1 if the DataSet does not contain any Sequence with such a combination. |
DataSet |
getSuffixDataSet(int start)
This method enables you to use only a suffix of all elements, i.e. the Sequence, in the current DataSet. |
static DataSet |
intersection(DataSet... samples)
This method computes the intersection between all elements/ DataSet
s of the array, i.e. it returns a DataSet containing only
Sequences that are contained in all DataSets of the array. |
boolean |
isDiscreteDataSet()
This method indicates if all positions use discrete values. |
boolean |
isSimpleDataSet()
This method indicates whether all random variables are defined over the same range, i.e. all positions use the same (fixed) alphabet. |
Iterator<Sequence> |
iterator()
|
DataSet[] |
partition(DataSet.PartitionMethod method,
double... percentage)
This method partitions the elements, i.e. the Sequences, of the
DataSet in distinct parts where each part holds the corresponding
percentage given in the array percentage. |
Pair<DataSet[],double[][]> |
partition(double[] sequenceWeights,
DataSet.PartitionMethod method,
double... percentage)
This method partitions the elements, i.e. the Sequences, of the
DataSet and the corresponding weights in distinct parts where each part holds the corresponding
percentage given in the array percentage. |
Pair<DataSet[],double[][]> |
partition(double[] sequenceWeights,
int k,
DataSet.PartitionMethod method)
This method partitions the elements, i.e. the Sequences, of the
DataSet and the corresponding weights in k distinct parts. |
DataSet[] |
partition(double p,
DataSet.PartitionMethod method,
int subsequenceLength)
This method partitions the elements, i.e. the Sequences, of the
DataSet in two distinct parts. |
DataSet[] |
partition(int k,
DataSet.PartitionMethod method)
This method partitions the elements, i.e. the Sequences, of the
DataSet in k distinct parts. |
void |
save(File f)
This method writes the DataSet to a file f. |
void |
save(OutputStream stream,
char commentChar,
SequenceAnnotationParser p)
This method allows to write all Sequences including their
SequenceAnnotations into a OutputStream. |
DataSet |
subSampling(int number)
Randomly samples elements, i.e. |
String |
toString()
|
static DataSet |
union(DataSet... s)
Unites all DataSets of the array s. |
static DataSet |
union(DataSet[] s,
boolean[] in)
This method unites all DataSets of the array s
regarding the array in. |
static DataSet |
union(DataSet[] s,
boolean[] in,
int subsequenceLength)
This method unites all DataSets of the array s
regarding the array in and sets the element length in the
united DataSet to subsequenceLength. |
static DataSet |
union(DataSet[] s,
int subsequenceLength)
This method unites all DataSets of the array s and
sets the element length in the united sample to
subsequenceLength. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public DataSet(AlphabetContainer abc,
AbstractStringExtractor se)
throws WrongAlphabetException,
EmptyDataSetException,
WrongLengthException
DataSet from a StringExtractor
using the given AlphabetContainer.
abc - the AlphabetContainerse - the StringExtractor
WrongAlphabetException - if the AlphabetContainer is not suitable
EmptyDataSetException - if the DataSet would be empty
WrongLengthException - never happens (forwarded from
DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
)DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
public DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
int subsequenceLength)
throws WrongAlphabetException,
WrongLengthException,
EmptyDataSetException
DataSet from a StringExtractor
using the given AlphabetContainer and all overlapping windows of
length subsequenceLength.
abc - the AlphabetContainerse - the StringExtractorsubsequenceLength - the length of the window sliding on the String of
se, if len is 0 (zero) then the
Sequences are used as given from the
StringExtractor
WrongAlphabetException - if the AlphabetContainer is not suitable
WrongLengthException - if the subsequence length is not supported
EmptyDataSetException - if the DataSet would be emptyDataSet(AlphabetContainer, AbstractStringExtractor, String, int)
public DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim)
throws WrongAlphabetException,
EmptyDataSetException,
WrongLengthException
DataSet from a StringExtractor
using the given AlphabetContainer and a delimiter
delim.
abc - the AlphabetContainerse - the StringExtractordelim - the delimiter for parsing the Strings
WrongAlphabetException - if the AlphabetContainer is not suitable
EmptyDataSetException - if the DataSet would be empty
WrongLengthException - never happens (forwarded from
DataSet(AlphabetContainer, AbstractStringExtractor, String, int)
)DataSet(AlphabetContainer, AbstractStringExtractor, String,
int)
public DataSet(AlphabetContainer abc,
AbstractStringExtractor se,
String delim,
int subsequenceLength)
throws EmptyDataSetException,
WrongAlphabetException,
WrongLengthException
DataSet from a StringExtractor
using the given AlphabetContainer, the given delimiter
delim and all overlapping windows of length
subsequenceLength.
abc - the AlphabetContainerse - the StringExtractordelim - the delimiter for parsing the StringssubsequenceLength - the length of the window sliding on the String of
se, if len is 0 (zero) then the
Sequences are used as given from the
StringExtractor
WrongAlphabetException - if the AlphabetContainer is not suitable
EmptyDataSetException - if the DataSet would be empty
WrongLengthException - if the subsequence length is not supported
public DataSet(DataSet s,
int subsequenceLength)
throws WrongLengthException
DataSet from a given DataSet and a given
length subsequenceLength.DataSet.
getElementAt(int) are real objects and do not have to be created
at the invocation of the method. (The same holds for the
DataSet.ElementEnumerator. In those cases both ways to access the
Sequence are approximately equally fast.)
s - the given DataSetsubsequenceLength - the new element length
WrongLengthException - if something is wrong with subsequenceLength
public DataSet(String annotation,
Sequence... seqs)
throws EmptyDataSetException,
WrongAlphabetException
DataSet from an array of Sequences and a
given annotation.StatisticalModel.emitDataSet(int, int...)
annotation - the annotation of the DataSetseqs - the Sequence(s)
EmptyDataSetException - if the array seqs is null or the
length is 0
WrongAlphabetException - if the AlphabetContainers do not match| Method Detail |
|---|
public static final String getAnnotation(DataSet... s)
DataSets.
s - an array of DataSets
getAnnotation()
public static final DataSet diff(DataSet data,
DataSet... samples)
throws EmptyDataSetException,
WrongAlphabetException
DataSet data and
the DataSets samples.
data - the minuendsamples - the subtrahends
WrongAlphabetException - if the AlphabetContainers do not match, i.e., if the DataSets are from different domains
EmptyDataSetException - if the difference is empty
public static final DataSet intersection(DataSet... samples)
throws IllegalArgumentException,
EmptyDataSetException
DataSet
s of the array, i.e. it returns a DataSet containing only
Sequences that are contained in all DataSets of the array.
samples - the array of DataSets
DataSets in the array
IllegalArgumentException - if the elements of the array are from different domains
EmptyDataSetException - if the intersection is empty
public static final DataSet union(DataSet[] s,
boolean[] in)
throws IllegalArgumentException,
EmptyDataSetException
DataSets of the array s
regarding the array in.
s - the array of DataSetsin - an array indicating which DataSet is used in the union,
if in[i]==true the DataSet
s[i] is used
DataSet
IllegalArgumentException - if s.length != in.length or the Alphabet
s do not match
EmptyDataSetException - if the union is emptyunion(DataSet[], boolean[], int)
public static final DataSet union(DataSet... s)
throws IllegalArgumentException
DataSets of the array s.
s - the array of DataSets
DataSet
IllegalArgumentException - if the Alphabets do not matchunion(DataSet[], boolean[])
public static final DataSet union(DataSet[] s,
boolean[] in,
int subsequenceLength)
throws IllegalArgumentException,
EmptyDataSetException,
WrongLengthException
DataSets of the array s
regarding the array in and sets the element length in the
united DataSet to subsequenceLength.
s - the array of DataSetsin - an array indicating which DataSet is used in the union,
if in[i]==true the DataSet
s[i] is usedsubsequenceLength - the length of the elements in the united DataSet
DataSet
IllegalArgumentException - if s.length != in.length or the Alphabet
s do not match
EmptyDataSetException - if the union is empty
WrongLengthException - if the united DataSet does not support this
subsequenceLength
public static final DataSet union(DataSet[] s,
int subsequenceLength)
throws IllegalArgumentException,
WrongLengthException
DataSets of the array s and
sets the element length in the united sample to
subsequenceLength.
s - the array of DataSetssubsequenceLength - the length of the elements in the united DataSet
DataSet
IllegalArgumentException - if the Alphabets do not match
WrongLengthException - if the united DataSet does not support this
subsequenceLengthunion(DataSet[], boolean[], int)public Sequence[] getAllElements()
Sequences containing all elements of this
DataSet.
Sequences) of this DataSetDataSet.ElementEnumeratorpublic final AlphabetContainer getAlphabetContainer()
AlphabetContainer of this DataSet.
AlphabetContainer of this DataSetpublic final String getAnnotation()
DataSet.
DataSet
public final DataSet getCompositeDataSet(int[] starts,
int[] lengths)
throws IllegalArgumentException
Sequences of all
elements in the current DataSet. Each composite Sequence
will be build from one corresponding Sequence in this
DataSet and all composite Sequences
will be returned in a new DataSet.
starts - the start positions of the chunkslengths - the lengths of the chunks
DataSet
IllegalArgumentException - if either starts or lengths or both
in combination are not suitableSequence.getCompositeSequence(AlphabetContainer, int[], int[])public Sequence getElementAt(int i)
Sequence, with index
i. See also this
comment.
i - the index of the element, i.e. the Sequence
Sequence, with index ipublic int getElementLength()
Sequences, in this
DataSet.
Sequences, in this
DataSetpublic double getAverageElementLength()
Sequences in this DataSet.
public final DataSet getInfixDataSet(int start,
int length)
throws IllegalArgumentException
Sequences, in the current DataSet. The subsequences will
be returned in an new DataSet.
DataSet of prefixes if
the element length is not zero.
start - the start position of the infixlength - the length of the infix, has to be positive
DataSet of the specified infixes
IllegalArgumentException - if either start or length or both
in combination are not suitable
public DataSet getReverseComplementaryDataSet()
throws OperationNotSupportedException
DataSet that contains the reverse complement of all Sequences in
this DataSet.
OperationNotSupportedException - if the AlphabetContainer of any of the Sequences in this DataSet
is not complementablepublic int getMinimalElementLength()
Sequence, in
this DataSet.
Sequence, in
this DataSetpublic int getMaximalElementLength()
Sequence, in
this DataSet.
Sequence, in
this DataSetpublic int getNumberOfElements()
Sequences, in this
DataSet.
Sequences, in this
DataSetpublic Iterator<Sequence> iterator()
iterator in interface Iterable<Sequence>
public int getNumberOfElementsWithLength(int len)
throws WrongLengthException
len - the length of the elements
WrongLengthException - if the given length is bigger than the minimal element lengthgetNumberOfElementsWithLength(int, double[])
public double getNumberOfElementsWithLength(int len,
double[] weights)
throws WrongLengthException,
IllegalArgumentException
len - the length of the elementsweights - the weights of each element of the sample (see getElementAt(int)), can be null
WrongLengthException - if the given length is bigger than the minimal element length
IllegalArgumentException - if the weights have a wrong dimension
public final DataSet getSuffixDataSet(int start)
throws IllegalArgumentException
Sequence, in the current DataSet. The subsequences will be
returned in an new DataSet.
start - the start position of the suffix
DataSet of specified suffixes
IllegalArgumentException - if start is not suitablepublic final boolean isSimpleDataSet()
true if the DataSet is simple,
false otherwiseAlphabetContainer.isSimple()public final boolean isDiscreteDataSet()
true if the DataSet is discrete,
false otherwiseAlphabetContainer.isDiscrete()
public DataSet[] partition(double p,
DataSet.PartitionMethod method,
int subsequenceLength)
throws WrongLengthException,
UnsupportedOperationException,
EmptyDataSetException
Sequences, of the
DataSet in two distinct parts. The second part (test sample) holds
the percentage of p, the first the rest (train sample). The
first part has element length as the current DataSet, the second
has element length subsequenceLength, which might be
necessary for testing.
p - the percentage for the second part, the second part holds at
least this percentage of the full DataSetmethod - the method how to partition the sample (partitioning
criterion)subsequenceLength - the element length of the second part, if 0 (zero) then the
sequences are used as given in this DataSet
DataSets
WrongLengthException - if something is wrong with subsequenceLength
UnsupportedOperationException - if the DataSet is not simple
EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS,
partition(PartitionMethod, double...)
public DataSet[] partition(DataSet.PartitionMethod method,
double... percentage)
throws IllegalArgumentException,
EmptyDataSetException
Sequences, of the
DataSet in distinct parts where each part holds the corresponding
percentage given in the array percentage.
method - the method how to partition the DataSet (partitioning
criterion)percentage - the array of percentages for each "subsample"
DataSets
IllegalArgumentException - if something with the percentages is not correct (
sum != 1 or one value is not in
[0,1])
EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
public Pair<DataSet[],double[][]> partition(double[] sequenceWeights,
DataSet.PartitionMethod method,
double... percentage)
throws IllegalArgumentException,
EmptyDataSetException
Sequences, of the
DataSet and the corresponding weights in distinct parts where each part holds the corresponding
percentage given in the array percentage.
sequenceWeights - the weights for the sequences (might be null)method - the method how to partition the DataSet (partitioning
criterion)percentage - the array of percentages for each "subsample"
Pair containing an array of partitioned DataSets and an array of partitioned sequence weights
IllegalArgumentException - if something with the percentages is not correct (
sum != 1 or one value is not in
[0,1])
EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
public DataSet[] partition(int k,
DataSet.PartitionMethod method)
throws IllegalArgumentException,
EmptyDataSetException
Sequences, of the
DataSet in k distinct parts.
k - the number of distinct partsmethod - the method how to partition the DataSet (partitioning
criterion)
DataSets
IllegalArgumentException - if k is not correct
EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
public Pair<DataSet[],double[][]> partition(double[] sequenceWeights,
int k,
DataSet.PartitionMethod method)
throws IllegalArgumentException,
EmptyDataSetException
Sequences, of the
DataSet and the corresponding weights in k distinct parts.
sequenceWeights - the weights for the sequences (might be null)k - the number of distinct partsmethod - the method how to partition the DataSet (partitioning
criterion)
Pair containing an array of partitioned DataSets and an array of partitioned sequence weights
IllegalArgumentException - if k is not correct
EmptyDataSetException - if at least one of the created partitions is emptyDataSet.PartitionMethod,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS,
DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS
public DataSet subSampling(int number)
throws EmptyDataSetException
Sequences, from the set of all
elements, i.e. the Sequences, contained in this DataSet. DataSet is chosen to contain overlapping
elements (windows of length subsequenceLength) or not, those
elements (overlapping windows, whole sequences) are subsampled.
number - the number of Sequences that should be drawn from the
contained set of Sequences (with replacement)
DataSet containing the drawn Sequences
EmptyDataSetException - if number is not positive
public final void save(File f)
throws IOException
DataSet to a file f.
f - the File
IOException - if something went wrong with the filesave(OutputStream, char, SequenceAnnotationParser)
public final void save(OutputStream stream,
char commentChar,
SequenceAnnotationParser p)
throws IOException
Sequences including their
SequenceAnnotations into a OutputStream. The
SequenceAnnotations are parsed using the
SequenceAnnotationParser.
stream - the stream which is used to write the DataSetcommentChar - the character that marks comment linesp - the parser for the SequenceAnnotations of the
Sequences
IOException - if something went wrong while writing into the stream.SequenceAnnotationParser.parseAnnotationToComment(char,
SequenceAnnotation...)public String toString()
toString in class Objectpublic Hashtable<String,HashSet<String>> getAnnotationTypesAndIdentifier()
SequenceAnnotation types and the corresponding
identifier which occur in this DataSet.
Hashtable with key = SequenceAnnotation type and identifier = SequenceAnnotation identifierSequenceAnnotation
public int[][] getSequenceAnnotationIndexMatrix(String rowType,
Hashtable<String,Integer> rowHash,
String columnType,
Hashtable<String,Integer> columnHash)
Sequence with specific SequenceAnnotation
combination or -1 if the DataSet does not contain any Sequence with such a combination. The rows and
columns are indexed according to the Hashtables.
int[][] matrix = s.getSequenceAnnotationIndexMatrix( rowType, rowHash, columnType, columnHash )
if( matrix[i][j] < 0 ) {
System.out.println( "There is no Sequence in the DataSet with this SequenceAnnotation combination");
} else {
System.out.println( "This is the Sequence: " + s.getElementAt( matrix[i][j] ) );
}
rowType - the SequenceAnnotation type for the rowsrowHash - a Hashtable of SequenceAnnotation identifier and indices for the rowscolumnType - the SequenceAnnotation type for the columnscolumnHash - a Hashtable of SequenceAnnotation identifier and indices for the columns
Sequences with each specific combination of
SequenceAnnotation for code>rowType and columnType and -1
if this combination does not exist in the DataSetgetAnnotationTypesAndIdentifier(),
ToolBox.parseHashSet2IndexHashtable(HashSet)
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||