de.jstacs.data
Class Sample

java.lang.Object
  extended by de.jstacs.data.Sample

public class Sample
extends Object

This is the class for any sample of Sequences. All Sequences in a Sample have to have the same AlphabetContainer. The Sequences may have different lengths.
For the internal representation the class Sequence is used, where the external alphabet is converted to integral numerical values. The class Sample knows about this coding via instances of class AlphabetContainer and accordingly Alphabet.

There are different ways to access the elements of a Sample. If one needs random access there is the method getElementAt(int). For fast sequential access it is recommended to use an Sample.ElementEnumerator.

Sample is immutable.

Author:
Jens Keilwagen
See Also:
AlphabetContainer, Alphabet, Sequence

Nested Class Summary
static class Sample.ElementEnumerator
          This class can be used to have a fast sequential access to a Sample.
static class Sample.PartitionMethod
          This enum defines different partition methods for a Sample.
static class Sample.WeightedSampleFactory
          This class enables you to eliminate Sequences that occur more than once in one or more Samples.
 
Constructor Summary
Sample(AlphabetContainer abc, AbstractStringExtractor se)
          Creates a new Sample from a StringExtractor using the given AlphabetContainer.
Sample(AlphabetContainer abc, AbstractStringExtractor se, int subsequenceLength)
          Creates a new Sample from a StringExtractor using the given AlphabetContainer and all overlapping windows of length subsequenceLength.
Sample(AlphabetContainer abc, AbstractStringExtractor se, String delim)
          Creates a new Sample from a StringExtractor using the given AlphabetContainer and a delimiter delim.
Sample(AlphabetContainer abc, AbstractStringExtractor se, String delim, int subsequenceLength)
          Creates a new Sample from a StringExtractor using the given AlphabetContainer, the given delimiter delim and all overlapping windows of length subsequenceLength.
Sample(Sample s, int subsequenceLength)
          Creates a new Sample from a given Sample and a given length subsequenceLength.
Sample(String annotation, Sequence... seqs)
          Creates a new Sample from an array of Sequences and a given annotation.
 
Method Summary
 Sequence[] getAllElements()
          Returns an array of Sequences containing all elements of this Sample.
 AlphabetContainer getAlphabetContainer()
          Returns the AlphabetContainer of this Sample.
 String getAnnotation()
          Returns some annotation of the Sample.
static String getAnnotation(Sample... s)
          Returns the annotation for an array of Samples.
 Sample getCompositeSample(int[] starts, int[] lengths)
          This method enables you to use only composite Sequences of all elements in the current Sample.
 Sequence getElementAt(int i)
          This method returns the element, i.e. the Sequence, with index i.
 int getElementLength()
          Returns the length of the elements, i.e. the Sequences, in this Sample.
 Sample getInfixSample(int start, int length)
          This method enables you to use only an infix of all elements, i.e. the Sequences, in the current Sample.
 int getMaximalElementLength()
          Returns the maximal length of an element, i.e. a Sequence, in this Sample.
 int getMinimalElementLength()
          Returns the minimal length of an element, i.e. a Sequence, in this Sample.
 int getNumberOfElements()
          Returns the number of elements, i.e. the Sequences, in this Sample.
 int getNumberOfElementsWithLength(int len)
          Returns the number of overlapping elements that can be extracted.
 Sample getSuffixSample(int start)
          This method enables you to use only a suffix of all elements, i.e. the Sequence, in the current Sample.
static Sample intersection(Sample... samples)
          This method computes the intersection between all elements/Sample s of the array, i.e. it returns a Sample containing only Sequences that are contained in all Samples of the array.
 boolean isDiscreteSample()
          This method indicates if all positions use discrete values.
 boolean isSimpleSample()
          This method indicates whether all random variables are defined over the same range, i.e. all positions use the same (fixed) alphabet.
 Sample[] partition(double p, Sample.PartitionMethod method, int subsequenceLength)
          This method partitions the elements, i.e. the Sequences, of the Sample in two distinct parts.
 Sample[] partition(int k, Sample.PartitionMethod method)
          This method partitions the elements, i.e. the Sequences, of the Sample in k distinct parts.
 Sample[] partition(Sample.PartitionMethod method, double... percentage)
          This method partitions the elements, i.e. the Sequences, of the Sample in distinct parts where each part holds the corresponding percentage given in the array percentage.
 void save(String msg, File f)
          This method writes a message msg and the Sample to a file f.
 Sample subSampling(int number)
          Randomly samples elements, i.e.
 String toString()
           
static Sample union(Sample... s)
          Unites all Samples of the array s.
static Sample union(Sample[] s, boolean[] in)
          This method unites all Samples of the array s regarding the array in.
static Sample union(Sample[] s, boolean[] in, int subsequenceLength)
          This method unites all Samples of the array s regarding the array in and sets the element length in the united Sample to subsequenceLength.
static Sample union(Sample[] s, int subsequenceLength)
          This method unites all Samples of the array s and sets the element length in the united sample to subsequenceLength.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Sample

public Sample(AlphabetContainer abc,
              AbstractStringExtractor se)
       throws WrongAlphabetException,
              EmptySampleException,
              WrongLengthException
Creates a new Sample from a StringExtractor using the given AlphabetContainer.

Parameters:
abc - the AlphabetContainer
se - the StringExtractor
Throws:
WrongAlphabetException - if the AlphabetContainer is not suitable
EmptySampleException - if the Sample would be empty
WrongLengthException - never happens (forwarded from Sample(AlphabetContainer, AbstractStringExtractor, String, int) )
See Also:
Sample(AlphabetContainer, AbstractStringExtractor, String, int)

Sample

public Sample(AlphabetContainer abc,
              AbstractStringExtractor se,
              int subsequenceLength)
       throws WrongAlphabetException,
              WrongLengthException,
              EmptySampleException
Creates a new Sample from a StringExtractor using the given AlphabetContainer and all overlapping windows of length subsequenceLength.

Parameters:
abc - the AlphabetContainer
se - the StringExtractor
subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
Throws:
WrongAlphabetException - if the AlphabetContainer is not suitable
WrongLengthException - if the subsequence length is not supported
EmptySampleException - if the Sample would be empty
See Also:
Sample(AlphabetContainer, AbstractStringExtractor, String, int)

Sample

public Sample(AlphabetContainer abc,
              AbstractStringExtractor se,
              String delim)
       throws WrongAlphabetException,
              EmptySampleException,
              WrongLengthException
Creates a new Sample from a StringExtractor using the given AlphabetContainer and a delimiter delim.

Parameters:
abc - the AlphabetContainer
se - the StringExtractor
delim - the delimiter for parsing the Strings
Throws:
WrongAlphabetException - if the AlphabetContainer is not suitable
EmptySampleException - if the Sample would be empty
WrongLengthException - never happens (forwarded from Sample(AlphabetContainer, AbstractStringExtractor, String, int) )
See Also:
Sample(AlphabetContainer, AbstractStringExtractor, String, int)

Sample

public Sample(AlphabetContainer abc,
              AbstractStringExtractor se,
              String delim,
              int subsequenceLength)
       throws EmptySampleException,
              WrongAlphabetException,
              WrongLengthException
Creates a new Sample from a StringExtractor using the given AlphabetContainer, the given delimiter delim and all overlapping windows of length subsequenceLength.

Parameters:
abc - the AlphabetContainer
se - the StringExtractor
delim - the delimiter for parsing the Strings
subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
Throws:
WrongAlphabetException - if the AlphabetContainer is not suitable
EmptySampleException - if the Sample would be empty
WrongLengthException - if the subsequence length is not supported

Sample

public Sample(Sample s,
              int subsequenceLength)
       throws WrongLengthException
Creates a new Sample from a given Sample and a given length subsequenceLength.
This constructor enables you to use subsequences of the elements of a Sample.

It can also be used to ensure that all sequences that can be accessed by getElementAt(int) are real objects and do not have to be created at the invocation of the method. (The same holds for the Sample.ElementEnumerator. In those cases both ways to access the Sequence are approximately equally fast.)

Parameters:
s - the given Sample
subsequenceLength - the new element length
Throws:
WrongLengthException - if something is wrong with subsequenceLength
See Also:
Sample(Sample, int, boolean)

Sample

public Sample(String annotation,
              Sequence... seqs)
       throws EmptySampleException,
              IllegalArgumentException
Creates a new Sample from an array of Sequences and a given annotation.
This constructor is specially designed for the method Model.emitSample(int, int...).

Parameters:
annotation - the annotation of the Sample
seqs - the Sequence(s)
Throws:
EmptySampleException - if the array seqs is null or the length is 0
IllegalArgumentException - if the Alphabets do not match
Method Detail

getAnnotation

public static final String getAnnotation(Sample... s)
Returns the annotation for an array of Samples.

Parameters:
s - an array of Samples
Returns:
the annotation
See Also:
getAnnotation()

intersection

public static final Sample intersection(Sample... samples)
                                 throws IllegalArgumentException,
                                        EmptySampleException
This method computes the intersection between all elements/Sample s of the array, i.e. it returns a Sample containing only Sequences that are contained in all Samples of the array.

Parameters:
samples - the array of Samples
Returns:
the intersection of the elements/Samples in the array
Throws:
IllegalArgumentException - if the elements of the array are from different domains
EmptySampleException - if the intersection is empty

union

public static final Sample union(Sample[] s,
                                 boolean[] in)
                          throws IllegalArgumentException,
                                 EmptySampleException
This method unites all Samples of the array s regarding the array in.

Parameters:
s - the array of Samples
in - an array indicating which Sample is used in the union, if in[i]==true the Sample s[i] is used
Returns:
the united Sample
Throws:
IllegalArgumentException - if s.length != in.length or the Alphabet s do not match
EmptySampleException - if the union is empty
See Also:
union(Sample[], boolean[], int)

union

public static final Sample union(Sample... s)
                          throws IllegalArgumentException
Unites all Samples of the array s.

Parameters:
s - the array of Samples
Returns:
the united Sample
Throws:
IllegalArgumentException - if the Alphabets do not match
See Also:
union(Sample[], boolean[])

union

public static final Sample union(Sample[] s,
                                 boolean[] in,
                                 int subsequenceLength)
                          throws IllegalArgumentException,
                                 EmptySampleException,
                                 WrongLengthException
This method unites all Samples of the array s regarding the array in and sets the element length in the united Sample to subsequenceLength.

Parameters:
s - the array of Samples
in - an array indicating which Sample is used in the union, if in[i]==true the Sample s[i] is used
subsequenceLength - the length of the elements in the united Sample
Returns:
the united Sample
Throws:
IllegalArgumentException - if s.length != in.length or the Alphabet s do not match
EmptySampleException - if the union is empty
WrongLengthException - if the united Sample does not support this subsequenceLength

union

public static final Sample union(Sample[] s,
                                 int subsequenceLength)
                          throws IllegalArgumentException,
                                 WrongLengthException
This method unites all Samples of the array s and sets the element length in the united sample to subsequenceLength.

Parameters:
s - the array of Samples
subsequenceLength - the length of the elements in the united Sample
Returns:
the united Sample
Throws:
IllegalArgumentException - if the Alphabets do not match
WrongLengthException - if the united Sample does not support this subsequenceLength
See Also:
union(Sample[], boolean[], int)

getAllElements

public Sequence[] getAllElements()
Returns an array of Sequences containing all elements of this Sample.

Returns:
all elements (Sequences) of this Sample
See Also:
Sample.ElementEnumerator

getAlphabetContainer

public final AlphabetContainer getAlphabetContainer()
Returns the AlphabetContainer of this Sample.

Returns:
the AlphabetContainer of this Sample

getAnnotation

public final String getAnnotation()
Returns some annotation of the Sample.

Returns:
some annotation of the Sample

getCompositeSample

public final Sample getCompositeSample(int[] starts,
                                       int[] lengths)
                                throws IllegalArgumentException
This method enables you to use only composite Sequences of all elements in the current Sample. Each composite Sequence will be build from one corresponding Sequence in this Sample and all composite Sequences will be returned in a new Sample.

Parameters:
starts - the start positions of the chunks
lengths - the lengths of the chunks
Returns:
a composite Sample
Throws:
IllegalArgumentException - if either starts or lengths or both in combination are not suitable
See Also:
Sequence.getCompositeSequence(AlphabetContainer, int[], int[])

getElementAt

public Sequence getElementAt(int i)
This method returns the element, i.e. the Sequence, with index i. See also this comment.

Parameters:
i - the index of the element, i.e. the Sequence
Returns:
the element, i.e. the Sequence, with index i

getElementLength

public int getElementLength()
Returns the length of the elements, i.e. the Sequences, in this Sample.

Returns:
the length of the elements, i.e. the Sequences, in this Sample

getInfixSample

public final Sample getInfixSample(int start,
                                   int length)
                            throws IllegalArgumentException
This method enables you to use only an infix of all elements, i.e. the Sequences, in the current Sample. The subsequences will be returned in an new Sample.

This method can also be used to create a Sample of prefixes if the element length is not zero.

Parameters:
start - the start position of the infix
length - the length of the infix, has to be positive
Returns:
a Sample of the specified infixes
Throws:
IllegalArgumentException - if either start or length or both in combination are not suitable

getMinimalElementLength

public int getMinimalElementLength()
Returns the minimal length of an element, i.e. a Sequence, in this Sample.

Returns:
the minimal length of an element, i.e. a Sequence, in this Sample

getMaximalElementLength

public int getMaximalElementLength()
Returns the maximal length of an element, i.e. a Sequence, in this Sample.

Returns:
the maximal length of an element, i.e. a Sequence, in this Sample

getNumberOfElements

public int getNumberOfElements()
Returns the number of elements, i.e. the Sequences, in this Sample.

Returns:
the number of elements, i.e. the Sequences, in this Sample

getNumberOfElementsWithLength

public int getNumberOfElementsWithLength(int len)
                                  throws WrongLengthException
Returns the number of overlapping elements that can be extracted.

Parameters:
len - the length of the elements
Returns:
the number of elements with the specified length
Throws:
WrongLengthException - if the given length is bigger than the minimal element length

getSuffixSample

public final Sample getSuffixSample(int start)
                             throws IllegalArgumentException
This method enables you to use only a suffix of all elements, i.e. the Sequence, in the current Sample. The subsequences will be returned in an new Sample.

Parameters:
start - the start position of the suffix
Returns:
a Sample of specified suffixes
Throws:
IllegalArgumentException - if start is not suitable

isSimpleSample

public final boolean isSimpleSample()
This method indicates whether all random variables are defined over the same range, i.e. all positions use the same (fixed) alphabet.

Returns:
true if the Sample is simple, false otherwise
See Also:
AlphabetContainer.isSimple()

isDiscreteSample

public final boolean isDiscreteSample()
This method indicates if all positions use discrete values.

Returns:
true if the Sample is discrete, false otherwise
See Also:
AlphabetContainer.isDiscrete()

partition

public Sample[] partition(double p,
                          Sample.PartitionMethod method,
                          int subsequenceLength)
                   throws WrongLengthException,
                          UnsupportedOperationException,
                          EmptySampleException
This method partitions the elements, i.e. the Sequences, of the Sample in two distinct parts. The second part (test sample) holds the percentage of p, the first the rest (train sample). The first part has element length as the current Sample, the second has element length subsequenceLength, which might be necessary for testing.

Parameters:
p - the percentage for the second part, the second part holds at least this percentage of the full Sample
method - the method how to partition the sample (partitioning criterion)
subsequenceLength - the element length of the second part, if 0 (zero) then the sequences are used as given in this Sample
Returns:
the array of partitioned Samples
Throws:
WrongLengthException - if something is wrong with subsequenceLength
UnsupportedOperationException - if the Sample is not simple
EmptySampleException - if at least one of the created partitions is empty
See Also:
Sample.PartitionMethod, Sample.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, Sample.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS, partition(PartitionMethod, double...), setSubsequenceLength(int)

partition

public Sample[] partition(Sample.PartitionMethod method,
                          double... percentage)
                   throws IllegalArgumentException,
                          EmptySampleException
This method partitions the elements, i.e. the Sequences, of the Sample in distinct parts where each part holds the corresponding percentage given in the array percentage.

Parameters:
method - the method how to partition the Sample (partitioning criterion)
percentage - the array of percentages for each "subsample"
Returns:
the array of partitioned Samples
Throws:
IllegalArgumentException - if something with the percentages is not correct ( sum != 1 or one value is not in [0,1])
EmptySampleException - if at least one of the created partitions is empty
See Also:
Sample.PartitionMethod, Sample.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, Sample.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS

partition

public Sample[] partition(int k,
                          Sample.PartitionMethod method)
                   throws IllegalArgumentException,
                          EmptySampleException
This method partitions the elements, i.e. the Sequences, of the Sample in k distinct parts.

Parameters:
k - the number of distinct parts
method - the method how to partition the Sample (partitioning criterion)
Returns:
the array of partitioned Samples
Throws:
IllegalArgumentException - if k is not correct
EmptySampleException - if at least one of the created partitions is empty
See Also:
Sample.PartitionMethod, Sample.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, Sample.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS

subSampling

public Sample subSampling(int number)
                   throws EmptySampleException
Randomly samples elements, i.e. Sequences, from the set of all elements, i.e. the Sequences, contained in this Sample.
Depending on whether this Sample is chosen to contain overlapping elements (windows of length subsequenceLength) or not, those elements (overlapping windows, whole sequences) are subsampled.

Parameters:
number - the number of Sequences that should be drawn from the contained set of Sequences (with replacement)
Returns:
a new Sample containing the drawn Sequences
Throws:
EmptySampleException - if number is not positive
See Also:
Sample(AlphabetContainer, Sequence[], int, String)

save

public final void save(String msg,
                       File f)
                throws IOException
This method writes a message msg and the Sample to a file f.

Parameters:
msg - the message, any information
f - the File
Throws:
IOException - if something went wrong with the file

toString

public String toString()
Overrides:
toString in class Object