by Jan Grau, Stefan Posch, Ivo Grosse, and Jens Keilwagen.
Protein binding microarrays (PBMs) are valuable for elucidating the binding affinity of transcription factors to short DNA sequence in vitro. However, learning accurate models of transcription factor binding from these data is still a challenging problem of bioinformatics. Here, we present a novel approach for analyzing PBM data based on a combination of discriminative learning of a ZOOPS model and an appropriate soft-labeling of the probe sequences.
We implement this approach as extension of Dispom and evaluate it on the benchmark data of Dream5 challenge 2, a rigorous benchmark for the analysis of PBM data. We find an improved overall performance compared to the participants of the challenge. Besides discriminative learning of model parameters, one reason for the superior performance of Dispom is the consideration of dependencies between adjacent positions of the binding sites, which suggests exploiting dependencies between adjacent base pairs in transcription factor binding sites whenever enough training data are available. Another important property is that this novel approach is robust to typical artifacts of PBM experiments, which facilitates its application to PBM data without the need for prior normalization.
The paper Accurate prediction of protein binding microarray data by discriminative de-novo motif discovery has been submitted to ISMB 2011.
We provide two binaries, one for training a model from PBM data including sequences and associated intensities and one for predicting intensities for given probe sequences using a trained model. By this means, training can be accomplished on e.g. a computation server, while the computationally less expensive predictions can be made on an ordinary workstation.
- Binaries for training and prediction as ZIP-archive
- Sources of the binaries, require Jstacs 1.4 sources to build
Trains a model from PBM data in Dream5-format (see Dream5 homepage) and stores the trained model to a file.
- Run by calling
java -jar Dream5.jar
home ... home directory (The home directory where the data reside., default = .) file ... input file (The input file in Dream5 format (column 1: TF, column 2: array type, column 3: sequences, column 4: signal, last column: flag), one file per TF and array type, path relative to home directory.) mo ... motif order (The order of the inhomogeneous Markov model for the motif., default = 1) fo ... flanking oder (The order of the homogeneous Markov model for flanking sequence and background., default = 3) starts ... starts (The number of starts of the optimization., default = 5) threads ... threads (The number of threads, i.e. cores, that are used for optimization., default = 1) q ... q (A-priori fraction of data points with weight greater than 0.5, default = 0.025) model ... model (File where the trained model is stored as XML, path relative to home directory., default = model.xml)
java -jar Dream5.jar file=TF1_HK.txt starts=1 threads=2 model=mymodel.xml
Loads a trained model from a file and predicts intensities for a given set of probe sequences in Dream5 submission format (see Dream5 homepage).
- Run by calling
java -jar Dream5Predict.jar
home ... home directory (The home directory where the data reside., default = .) file ... input file (The input file in Dream5 format for predictions (column 1: TF, column 2: array type, column 3: sequence), one file per TF and array type, path relative to home directory.) model ... model (File with the trained model stored as XML, path relative to home directory., default = model.xml) outfile ... outfile (File where the predictions are stored., default = predictions.txt)
java -jar Dream5Predict.jar file=TF1_ME.txt model=mymodel.xml outfile=predicitions_TF1_ME.txt