This page contains supplemental information to our submission to Dream6 challenge 4 "Classification of AML". For information about this challenge, please visit the Dream6 homepage. In the following we describe our approach and provide a download of the program used to make the predictions for the challenge and its source code.
The paper Critical assessment of automated flow cytometry data analysis techniques has been published in Nature Methods.
We base our classifier on the following assumptions:
- each experiment (tube) is an independent indication if this patient suffers from AML or not
- for each cell in a tube we may independently decide if this cell is infected or not
- the number of cells classified as infected differs substantially between patients suffering from AML and healthy patients
Following these assumptions we take a step-wise approach using Jstacs:
1. We build a classifier that returns the probability that a specific cell from a specific tube is infected or not. Such a classifier is learned for each kind of tube (that means each selection of markers) independently, where the measurements of all cells of AML patients are used as foreground (positive) and the measurements of all cells of healthy patients are used as background (negative). The log-values of the measurements are modeled by normal distributions and the parameters are learned by the maximum conditional likelihood principle.
2. For each patient, we compute the fraction of cells in a tube classified as infected. For each patient we obtain a series of 8 such fractions, one for each tube.
3. We create another classifier working on these 8 values using a logistic regression and learn its parameters by the maximum supervised posterior principle based on the labeling of patients. The output of this classifier is the final prediction.
For the predictions for the unlabeled data, we use the trained classifiers and follow the protocol as before: Classify each cell in each tube - compute the fraction of cells classified as infected - use logistic function for final prediction.
This approach can be summarized by the following pseudo code:
For each tube do Load the measurements for each cell in this tube; Compute log-values for all measurements; Create a sample based on the log-values of the individual cells and label all cell stemming from AML patients as foreground class and all cells from healthy patients as background class; Create a classifier based on 7 independent normal distributions, corresponding to the measurements for the two scatter and five antibody measurements, for the foreground and background class each; Estimate the parameters of the classifier (i.e., means and standard deviations) from this sample using the maximum conditional likelihood (MCL) learning principle. Classify the log-measurements of all cells of a patient and compute the fraction of cells (later denoted as patient posterior) with a probability P(AML | cell) > 0.5; Done; For each patient do Create a sequence of the 8 patient posteriors; Done; Create a sample from these sequences with labels according to the patients’ state of health; Create a classifier based on logistic regression; Estimate the parameters of the classifier from this sample using the maximum supervised posterior (MSP) learning principle and a product normal prior with standard deviation 1;
For each tube do Load the measurements for each cell in this tube; Compute log-values for all measurements; Create a sample based on the log-values of the individual cells; Classify the log-measurements of all cells of a patient using the previously trained classifier for this tube and compute the fraction of cells (later denoted as patient posterior) with a probability P(AML | cell) > 0.5; Done; For each patient do Create a sequence of the 8 patient posteriors; Done;
Finally, use the classifier based on logistic regression to obtain the final prediction based on this sequence of patient posteriors.
- Binaries, including the XML of the classifier used in the challenge
- Sources, additionally require Jstacs 2.0 sources to compile
- XML of the classifier used in the challenge
You can start the JAR contained in the binary on the command line by calling
java -jar Dream6C4.jar
Optional parameters are
home ... home directory (the path to the data directory, default = ./) = ./ desc ... description file (file containing the description of patients for the training data, default = DREAM6_AML_TrainingSet.csv) = DREAM6_AML_TrainingSet.csv key ... unhealthy key (key in the description file that indicates unhealthy patient, all other keys treated as healthy, default = AML) = AML classifier ... classifier (path to the classifier; if load=true, this classifier is loaded from the file, otherwise it is stored to that file, default = final-classifiers.xml) = final-classifiers.xml load ... load (load the classifier (instead of storing it), default = false) = false threads ... compute threads (the number of threads that are use to evaluate the objective function and its gradient, valid range = [1, 128], default = 2) = 2
If Dream6C4.jar is started from a directory containing
- the description file 'DREAM6_AML_TrainingSet.csv' of the challenge,
- a directory 'csv' with the CSVs of all patients and filenames of the format 'XXX.CSV', where XXX is equal to the file number (with additional leading 0s) in 'DREAM6_AML_FilesTubesAndSubjects.CSV', and
- the classifier description 'final-classifiers.xml', which is part of the binary download and can also be downloaded seperately
you obtain the predictions we submitted to the DREAM challenge.
If you want to use this method for other data sets, feel free to adapt the sources to your specific problem. To compile the sources, you also need Jstacs 2.0 (sources or binary .jar) in your class path.