https://www.jstacs.de/api.php?action=feedcontributions&user=Grau&feedformat=atomJstacs - User contributions [en]2024-03-28T10:28:01ZUser contributionsMediaWiki 1.38.2https://www.jstacs.de/index.php?title=AnnoTALE&diff=1164AnnoTALE2022-09-14T09:48:18Z<p>Grau: /* Class Builders */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt. Approve opening AnnoTALE by clicking on the button "Open Anyway" next to it.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.5.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.5.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.5.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.5.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.5.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.5.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.5.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.5'''<br />
* new "sensitive" mode of TALE Prediction tool, which may annotate TALEs in a wider range of Xanthomonas strains at the expense of an increased runtime; turned off by default<br />
* significantly improved speed of TALE Class Assignment tool<br />
* citation information for individual AnnoTALE tools available under a dedicated button in the GUI version and from the "info" command issued for individual tools in the command line version<br />
* bugfix for TALE Prediction in rather fragmented genome assemblies, where TALE predictions may extend to the ends of contigs/sequences<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_29_07_2022.xml.gz Version 29/07/2022]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_17_08_2021.xml.gz Version 17/08/2021]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1158Catchitt2021-10-07T20:51:46Z<p>Grau: /* Version history */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.4.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.4. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.4.jar</code> when running code examples with the current Catchitt binary version.<br />
<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
* Catchitt v0.1.4: Fixing a bug that lead to an Exception if chromosomes did not appear in lexicographical order in the input files<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.3.jar Catchitt v0.1.3]: Bugfix to load Catchitt classifiers learned with older Catchitt versions<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.2.jar Catchitt v0.1.2]: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1157Catchitt2021-10-07T20:49:03Z<p>Grau: /* Downloads */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.4.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.4. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.4.jar</code> when running code examples with the current Catchitt binary version.<br />
<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.3: Bugfix to load Catchitt classifiers learned with older Catchitt versions<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.2.jar Catchitt v0.1.2]: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1156Catchitt2021-10-07T20:48:49Z<p>Grau: /* Usage */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.3.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.4. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.4.jar</code> when running code examples with the current Catchitt binary version.<br />
<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.3: Bugfix to load Catchitt classifiers learned with older Catchitt versions<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.2.jar Catchitt v0.1.2]: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1151AnnoTALE2021-08-17T13:53:50Z<p>Grau: /* Class Builders */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt. Approve opening AnnoTALE by clicking on the button "Open Anyway" next to it.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.5.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.5.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.5.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.5.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.5.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.5.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.5.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.5'''<br />
* new "sensitive" mode of TALE Prediction tool, which may annotate TALEs in a wider range of Xanthomonas strains at the expense of an increased runtime; turned off by default<br />
* significantly improved speed of TALE Class Assignment tool<br />
* citation information for individual AnnoTALE tools available under a dedicated button in the GUI version and from the "info" command issued for individual tools in the command line version<br />
* bugfix for TALE Prediction in rather fragmented genome assemblies, where TALE predictions may extend to the ends of contigs/sequences<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_17_08_2021.xml.gz Version 17/08/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1150EpiTALE2021-05-12T15:16:03Z<p>Grau: /* GUI version */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt. Approve opening EpiTALE by clicking on the button "Open Anyway" next to it.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. Started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
=== Source code ===<br />
<br />
EpiTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/tals/epigenetic github].<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [https://doi.org/10.5281/zenodo.4749294 zenodo]. Besides the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1149AnnoTALE2021-05-12T15:15:47Z<p>Grau: /* Download */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt. Approve opening AnnoTALE by clicking on the button "Open Anyway" next to it.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.5.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.5.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.5.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.5.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.5.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.5.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.5.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.5'''<br />
* new "sensitive" mode of TALE Prediction tool, which may annotate TALEs in a wider range of Xanthomonas strains at the expense of an increased runtime; turned off by default<br />
* significantly improved speed of TALE Class Assignment tool<br />
* citation information for individual AnnoTALE tools available under a dedicated button in the GUI version and from the "info" command issued for individual tools in the command line version<br />
* bugfix for TALE Prediction in rather fragmented genome assemblies, where TALE predictions may extend to the ends of contigs/sequences<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1148AnnoTALE2021-05-12T15:04:10Z<p>Grau: /* AnnoTALE */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.5.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.5.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.5.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.5.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.5.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.5.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.5.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.5'''<br />
* new "sensitive" mode of TALE Prediction tool, which may annotate TALEs in a wider range of Xanthomonas strains at the expense of an increased runtime; turned off by default<br />
* significantly improved speed of TALE Class Assignment tool<br />
* citation information for individual AnnoTALE tools available under a dedicated button in the GUI version and from the "info" command issued for individual tools in the command line version<br />
* bugfix for TALE Prediction in rather fragmented genome assemblies, where TALE predictions may extend to the ends of contigs/sequences<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1147AnnoTALE2021-05-12T15:03:47Z<p>Grau: /* AnnoTALE */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.5.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.5.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.5.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.5.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.5.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.5.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.5.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.5'''<br />
* new "sensitive" mode of TALE Prediction tool, which may annotate TALEs in a wider range of Xanthomonas strains at the expense of an increased runtime; turned off by default<br />
* significantly improved speed of TALE Class Assignment tool<br />
* citation information for individual AnnoTALE tools available under a dedicated button in the GUI version and from the "info" command issued for individual tools in the command line version<br />
* bugfix for TALE Prediction in rather fragmented genome assemblies, where TALE predictions may extend to the ends of contigs/sequences<br />
<br />
<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1146AnnoTALE2021-05-12T14:58:52Z<p>Grau: /* AnnoTALE command line application */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.5.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.5.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.5.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.5.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.5.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.5.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.5.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.5.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.5.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.5.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1145AnnoTALE2021-05-12T14:57:00Z<p>Grau: /* Download */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.5.jar Runnable Jar] (requires installed Java >= 8, update 45), may be run under Linux, macOS and Windows<br />
* macOS app: [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-2GB.zip 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.5.app-6GB.zip 6GB version], ZIP archive containing a macOS app including AnnoTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list AnnoTALE after the first (possibly unsuccessful) starting attempt.<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.5-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.5-6GB.exe 6GB version, 64bit Java]<br />
* Windows version without installer: [http://www.jstacs.de/downloads/AnnoTALE-1.5-win.zip 6GB version, 64bit Java], ZIP archive containing AnnoTALE, all required Java modules, and a Windows batch file. For starting AnnoTALE, double-click AnnoTALE.bat.<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1144EpiTALE2021-05-12T14:48:19Z<p>Grau: /* Command line version */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. Started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
=== Source code ===<br />
<br />
EpiTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/tals/epigenetic github].<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [https://doi.org/10.5281/zenodo.4749294 zenodo]. Besides the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1143EpiTALE2021-05-12T14:43:43Z<p>Grau: /* Example data */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
=== Source code ===<br />
<br />
EpiTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/tals/epigenetic github].<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [https://doi.org/10.5281/zenodo.4749294 zenodo]. Besides the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1142EpiTALE2021-05-10T22:59:08Z<p>Grau: /* Download */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
=== Source code ===<br />
<br />
EpiTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/tals/epigenetic github].<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [https://zenodo.org zenodo]. Besides the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1141EpiTALE2021-05-10T22:50:04Z<p>Grau: /* Example data */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [https://zenodo.org zenodo]. Besides the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1140EpiTALE2021-05-10T22:49:53Z<p>Grau: /* Example data */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [https://zenodo.org zenodo]. Beside the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1139EpiTALE2021-05-10T22:49:29Z<p>Grau: </p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
== Example data ==<br />
<br />
We provide an archive with example data at [http:// zenodo]. Beside the data, this archive contains the command line version of the EpiTALE suite v0.1 and a bash script demonstrating the complete EpiTALE pipeline.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1138EpiTALE2021-05-10T22:47:44Z<p>Grau: /* Command line version */</p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1137EpiTALE2021-05-10T22:47:29Z<p>Grau: </p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
=== Command line version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALEcli-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS. May be started with<br />
java -jar EpiTALEcli-0.1.jar<br />
from the command line (for tools and arguments, see below).<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1136EpiTALE2021-05-10T22:43:57Z<p>Grau: </p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
== Download ==<br />
<br />
=== GUI version ===<br />
<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.jar Runnable Jar]: requires Java >= 8 including JavaFX installed, may be run under Linux, Windows and macOS.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1.app.zip macOS app]: ZIP archive containing a macOS app including EpiTALE and all required Java modules. For running this app, it might be required to explicitly give it running permissions in "System Preferences" -> "Security & Privacy" -> "General", which should list EpiTALE after the first (possibly unsuccessful) starting attempt.<br />
* [http://www.jstacs.de/downloads/EpiTALE-0.1-win.zip Windows program]: ZIP archive containing the EpiTALE Jar, all required Java modules, and a Windows batch file. For starting EpiTALE, double-click EpiTALE.bat.<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1135EpiTALE2021-05-10T22:36:27Z<p>Grau: </p>
<hr />
<div>[[File:EpiTALE_256.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=File:EpiTALE_256.png&diff=1134File:EpiTALE 256.png2021-05-10T22:35:44Z<p>Grau: </p>
<hr />
<div></div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1133EpiTALE2021-05-10T22:35:05Z<p>Grau: </p>
<hr />
<div>[[File:EpiTALE.png|130px|left]] EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1132EpiTALE2021-05-10T22:34:38Z<p>Grau: </p>
<hr />
<div>EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [[PrediTALE]] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [[AnnoTALE]].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1131EpiTALE2021-05-10T22:34:16Z<p>Grau: </p>
<hr />
<div>EpiTALE predicts binding sites of transcription activator-like effectors (TALEs) in promoteromes or genomes. EpiTALE not only considers the DNA sequence of putative binding sites but also epigenetic determinants of TALE binding, namely DNA methylation and chromatin accessibility. The prediction is based on the same basic model as [PrediTALE] but with specific parameters for methylated cytosines reflecting the binding preferences of RVDs.<br />
<br />
Here, we provide a suite of tools including the EpiTALE program itself but also auxiliary tools for converting methylation data and chromatin accessibility data to the required formats, and for converting genomic coordinates to promoter-wise coordinates for promoterome-wide predictions.<br />
<br />
Genome-wide predictions of EpiTALE may further be combined with evidence from RNA-seq data using the DerTALE tool of [AnnoTALE].<br />
<br />
The EpiTALE suite is provided in a version with a graphical user interface and in a command line version, which may serve the needs of specific user groups, both using the identical code base.<br />
<br />
In the following, we describe how to obtain the EpiTALE suite and how to use its individual tools. While parameters are described in terms of command line arguments, the same parameters are available in the version with graphical user interface.<br />
<br />
<br />
== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1130EpiTALE2021-05-10T19:55:48Z<p>Grau: </p>
<hr />
<div>== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' converts methylation information in bedMethyl format to Bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two Bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the Bismark output file to promoter coordinates.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a Bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter coordinates.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak containing peaks of chromatin accessibility file to promoter coordinates.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers the methylation state of the target box during prediction, as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additionally needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if Bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to Bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter coordinates. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=EpiTALE&diff=1129EpiTALE2021-05-10T08:05:43Z<p>Grau: Created page with "== Tools == === Bed2Bismark === '''Bed2Bismark''' Converts methylation information in bedMethyl format to bismark format. The input of '''Bed2Bismark''' is a file in bedMet..."</p>
<hr />
<div>== Tools ==<br />
<br />
=== Bed2Bismark ===<br />
<br />
'''Bed2Bismark''' Converts methylation information in bedMethyl format to bismark format.<br />
<br />
The input of '''Bed2Bismark''' is a file in bedMethyl format.<br />
<br />
If you experience problems using '''Bed2Bismark''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Bed2Bismark'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BedMethyl file (Methylationinformation in bedMethyl format, type = bed.gz,bed)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bed2bismark b=&lt;BedMethyl_file&gt;<br />
<br />
<br />
=== BismarkMerge2Files ===<br />
<br />
'''BismarkMerge2Files''' Merges files generated by [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code>.<br />
The output contains a coverage file, which contains the tab-separated columns:<br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code>.<br />
<br />
The input of '''BismarkMerge2Files''' are two bismark coverage files.<br />
<br />
If you experience problems using '''BismarkMerge2Files''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
''BismarkMerge2Files'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file 1 (Methylationinformation in bismark format file 1, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf2</font></td><br />
<td>Bismark file 2 (Methylationinformation in bismark format file 2, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bismerger b=&lt;Bismark_file_1&gt; bf2=&lt;Bismark_file_2&gt;<br />
<br />
<br />
=== BismarkConvertToPromoter ===<br />
<br />
'''BismarkConvertToPromoter''' converts the bismark output file to promoter search.<br />
<br />
The input of '''BismarkConvertToPromoter''' is <br />
1. a bismark coverage output file, which contains tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''BismarkConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''BismarkConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bismark file (Methylationinformation in bismark format, type = cov.gz,cov)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar bis2prom b=&lt;Bismark_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== Chromatin pileup ===<br />
<br />
'''Chromatin pileup''' takes as input a BAM file of mapped reads from an DNase-seq or ATAC-seq experiment <br />
and computes a coverage pileup of 5' ends of mapped reads, <br />
and outputs a simple tab-separated file with columns: <br />
<code>chromosome, position,</code> and <code>pileup value</code> (number of reads with a 5' end at this position).<br />
<br />
If you experience problems using '''Chromatin pileup''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Chromatin pileup'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BAM file (Mapped reads from DNase-seq or ATAC-seq experiment, type = bam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pileup b=&lt;BAM_file&gt;<br />
<br />
<br />
=== NormalizePileupOutput ===<br />
<br />
'''NormalizePileupOutput''' Normalizes the pileup output file, that contains the coverage with 5’ ATAC-seq or DNase-seq reads at each position. It normalizes the coverage relative to the mean of a 10000 bp sliding window.<br />
<br />
The input of '''NormalizePileupOutput''' is a pileup output file from '''Chromatin pileup''' tool.<br />
<br />
If you experience problems using '''NormalizePileupOutput''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NormalizePileupOutput'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Pileup output file (Pileup output file., type = tsv.gz,tsv,txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar normpileup p=&lt;Pileup_output_file&gt;<br />
<br />
<br />
=== PileupConvertToPromoter ===<br />
<br />
'''PileupConvertToPromoter''' converts the pileup output file to promoter search.<br />
<br />
The input of '''PileupConvertToPromoter''' is <br />
1. a normalized pileup output file from '''NormalizePileupOutput''' tool and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''PileupConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''PileupConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Normalized pileup output file (Normalized pileup output file., type = tsv.gz,tsv)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar pile2prom n=&lt;Normalized_pileup_output_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== NarrowPeakConvertToPromoter ===<br />
<br />
'''NarrowPeakConvertToPromoter''' converts the narrowPeak file to promoter search.<br />
<br />
The input of '''NarrowPeakConvertToPromoter''' is <br />
1. a narrowPeak file and <br />
2. the promoter sequences in FastA format with headers like:<br />
<code>> id chromosomeName:start-end:strand</code><br />
e.g.<br />
<code>> Os01g01010.1 Chr1:2602-3102:+</code>.<br />
<br />
If you experience problems using '''NarrowPeakConvertToPromoter''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''NarrowPeakConvertToPromoter'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>NarrowPeak file (Peak-calling output in narrowPeak format., type = narrowPeak,narrowPeak.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Promoter fasta file (Promoter fastA file, type = fa,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar peak2Prom n=&lt;NarrowPeak_file&gt; p=&lt;Promoter_fasta_file&gt;<br />
<br />
<br />
=== EpiTALE prediction ===<br />
<br />
'''EpiTALE''' predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE and optionally considers methylation state of the target box during prediction as DNA methylation affects the binding specificity of RVDs. <br />
Additionally, EpiTALE optionally annotates chromatin accessibility of predicted target sites using output of the '''NormalizePileupOutput''' tool and result of peak-calling of DNase-seq and ATAC-seq data to the predictions of '''EpiTALE'''.<br />
<br />
As input, '''EpiTALE''' requires<br />
<br />
1. a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). <br />
<br />
2. For computing p-values, EpiTALE additional needs a background set of sequences, which is by default generated as a sub-sample of the original input data.<br />
<br />
3. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general.<br />
<br />
4. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the ''TALE Analysis'' tool of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE].<br />
<br />
5. It can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to <code>0</code> in case of genome-wide predictions.<br />
<br />
6. As optional input '''EpiTALE''' considers methylation during prediction, if bismark output is provided. With [https://www.bioinformatics.babraham.ac.uk/projects/bismark/ Bismark methylation extractor] with parameters <code>–bedGraph –CX -p</code> you can generate a coverage file, which contains the tab-separated columns: <br />
<code>chromosome, start_position, end_position, methylation_percentage, count_methylated, count_unmethylated</code> (file.cov.gz). <br />
You can alternatively use the tool '''Bed2Bismark''', which converts data in BedMethyl format to bismark format. <br />
<br />
7.<br />
(i) The chromatin accessibility of the input sequences can optionally be provided in narrowPeak format. By mapping ATAC-seq or DNase-seq data to the corresponding genome and then performing peak calling, e.g. with [https://github.com/mahmoudibrahim/JAMM JAMM]. In case of promoter sequences as input, you should run the tool '''NarrowPeakConvertToPromoter''' to convert the narrowPeak-File to promoter positions. <br />
(ii) Additionally, you can calculate a coverage pileup of 5' ends of mapped reads with '''Chromatin pileup''' and normalize it with '''NormalizePileupOutput'''. In case of promoter sequences as input, you should run the tool '''PileupConvertToPromoter''' to convert to promoter positions. <br />
<br />
8.<br />
(i) In case of '''genomic search''' the parameter ''calculate coverage area'' should be <code>surround target site</code> and you can set the number of positions before target site with <code>coverage before value</code> (default: 300) and the positions after target site <code>coverage after value</code> (default: 200). <br />
(ii) In case of '''promoter search''' the parameter ''calculate coverage area'' may set to <code>on complete sequence</code> or <code>surround target site</code>. The number of positions before and after binding site in peak profile can be set by <code>Peak before value</code> (default: 300) and <code>Peak after value</code> (default: 50).<br />
<br />
In case of '''genomic search''' you can filter predictions of TALE target boxes by the presence of differentially expressed regions in a defined vicinity around a predicted target box. with the tool '''DerTALE''' of [http://www.jstacs.de/index.php/AnnoTALE AnnoTALE suite].<br />
<br />
If you experience problems using '''EpiTALE''', please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
''EpiTALE prediction'' may be called with<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold, type = fa,fas,fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">TALEs</font></td><br />
<td>TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format, type = fasta,fas,fa)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Strand</font></td><br />
<td>Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;both strands&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;forward strand&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;reverse strand&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Bismark file (The bedGraph output of bismark (file.cov.gz) containig <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>, type = cov,cov.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">nf</font></td><br />
<td>NarrowPeak file (The output of a peak caller (all.peaks.narrowPeak), type = narrowPeak,narrowPeak.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">npo</font></td><br />
<td>Normalized pileup output (The normalized output of pileup with values larger than zero (file.txt) containig <chromosome> <position> <coverage>, type = tsv,tsv.gz, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Calculate coverage area (Calculate coverage area surround target site, or on complete sequence, range={surround target site, on complete sequence}, default = surround target site, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;surround target site&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbv</font></td><br />
<td>Coverage before value (Number of positions before target site in coverage profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cav</font></td><br />
<td>Coverage after value (Number of positions after target site in coverage profile, valid range = [1, 500], default = 200, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;on complete sequence&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peak before value (Number of positions before target site in peak profile, valid range = [1, 500], default = 300, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pav</font></td><br />
<td>Peak after value (Number of positions after target site in peak profile, valid range = [1, 500], default = 50, OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar EpiTALEcli-0.1.jar epitale s=&lt;Sequences&gt; TALEs=&lt;TALEs&gt;</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1128AnnoTALE2021-05-09T07:21:56Z<p>Grau: /* AnnoTALE with GUI */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the implementation of JavaFX in Java >=8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1127AnnoTALE2021-05-09T07:20:57Z<p>Grau: </p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
For evolution-related studies using the comparative features of AnnoTALE, please also cite:<br />
<br />
Annett Erkes, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1093/gbe/evx108 Evolution of transcription activator-like effectors in Xanthomonas oryzae]. Genome Biology and Evolution, 9(6):1599–1615, 2017.<br />
<br />
<br />
If you use PrediTALE for predicting TALE targets, please also cite:<br />
<br />
Annett Erkes, Stefanie Mücke, Maik Reschke, Jens Boch, and Jan Grau. [https://doi.org/10.1371/journal.pcbi.1007206 PrediTALE: A novel model learned from quantitative data allows for new perspectives on TALE targeting]. PLOS Computational Biology, 15(7):1–31, 2019.<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the very recent implementation of JavaFX in Java 8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1126AnnoTALE2021-05-09T07:15:05Z<p>Grau: /* Class Builders */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the very recent implementation of JavaFX in Java 8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_09_05_2021.xml.gz Version 09/05/2021]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1125AnnoTALE2021-05-08T22:48:01Z<p>Grau: /* Class Builders */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the very recent implementation of JavaFX in Java 8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_29_09_2018.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_09_03_2017.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1121AnnoTALE2020-11-03T23:45:50Z<p>Grau: /* Class Builders */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the very recent implementation of JavaFX in Java 8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2020]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_current.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_current.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1119Catchitt2020-10-13T19:43:14Z<p>Grau: /* Version history */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.3.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.3. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.3.jar</code> when running code examples with the current Catchitt binary version.<br />
<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.3: Bugfix to load Catchitt classifiers learned with older Catchitt versions<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.2.jar Catchitt v0.1.2]: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1118Catchitt2020-10-13T19:41:24Z<p>Grau: /* Downloads */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt-0.1.3.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.3. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.3.jar</code> when running code examples with the current Catchitt binary version.<br />
<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.2: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1117Catchitt2020-10-13T19:41:10Z<p>Grau: /* Usage */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.3. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.3.jar</code> when running code examples with the current Catchitt binary version.<br />
<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.2: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=AnnoTALE&diff=1116AnnoTALE2020-10-10T10:49:55Z<p>Grau: /* Class Builders */</p>
<hr />
<div>[[File:AnnoTALE.png|130px|left]]<br />
Transcription activator-like effectors (TALEs) are virulence factors of plant-pathogenic Xanthomonas spp. that function as gene activators inside plant host cells.<br />
<br />
AnnoTALE is a suite of applications for identifying and analysing TALEs in Xanthomonas genomes, for clustering TALEs into classes by their RVD sequences, for assigning novel TALEs to existing classes, for proposing TALE names using a unified nomenclature, and for predicting targets of individual TALEs and TALE classes.<br />
<br />
AnnoTALE is available as a JavaFX-based stand-alone application with graphical user interface for interactive analysis sessions. <br />
In addition, we provide a command line application that may be integrated into other pipelines. <br />
Both use identical code for the actual analysis, ensuring consistent results between both versions.<br />
<br />
<br />
<br />
If you use AnnoTALE, please cite:<br />
<br />
Jan Grau, Maik Reschke, Annett Erkes, Jana Streubel, Richard D. Morgan, Geoffrey G. Wilson, Ralf Koebnik and Jens Boch. [http://www.nature.com/articles/srep21077 AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from ''Xanthomonas'' genomic sequences]. Scientific Reports 6:21077, DOI: 10.1038/srep21077, 2016.<br />
<br />
<br />
<br />
'''Important:''' If you would like to use the unified nomenclature of AnnoTALE in one of your publications including new TALEs or sequenced genomes, please contact us (grau@informatik.uni-halle.de) to organize the inclusion of your TALEs into the official class definition of AnnoTALE and to create stable TALE names that are unique to your TALEs.<br />
<br />
<br />
== AnnoTALE with GUI ==<br />
<br />
[[File:AnnoTALEscreenshot.jpg]]<br />
<br />
AnnoTALE is based on the very recent implementation of JavaFX in Java 8.<br />
<br />
We provide AnnoTALE as a runnable JAR file for those with a current version of Java 8 (at least update 45) on their machine.<br />
<br />
For user's convenience, we also provide pre-packaged versions of AnnoTALE, which also include Java in the required version, for Mac OS X and Windows. Each of these versions is available two version with different memory requirements (2GB and 6GB). As long as the main memory (RAM) of your machine is sufficient, we recommend to use the 6GB version of AnnoTALE.<br />
<br />
<br />
=== Download ===<br />
<br />
''AnnoTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
<br />
<br />
=== Source code ===<br />
<br />
The AnnoTALE source code is available from [https://github.com/Jstacs/Jstacs/tree/master/projects/xanthogenomes github].<br />
<br />
<br />
=== User Guide ===<br />
<br />
We provide an [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf AnnoTALE User Guide] in PDF format, including a detailed description of all AnnoTALE tools and installation instructions.<br />
<br />
<br />
== AnnoTALE command line application ==<br />
<br />
The AnnoTALE command line application is available as a [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar runnable Jar]. For running the program and a quick help, type<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
For larger analyes, it might be necessary to increase the memory allocated by the JavaVM using the <code>-Xms</code> and <code>-Xmx</code> parameters, for instance<br />
java -Xms512M -Xmx6G -jar AnnoTALEcli-1.4.1.jar<br />
<br />
There is no separate User Guide for the AnnoTALE command line application, but the [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version] describes all AnnoTALE tools, their parameters and outputs, and those of the CLI version are identical.<br />
<br />
You obtain a list of all AnnoTALE tools by calling<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar<br />
<br />
Output:<br />
<br />
Available tools:<br />
<br />
predict - TALE Prediction<br />
analyze - TALE Analysis<br />
build - TALE Class Builder<br />
loadAndView - Load and View TALE Classes<br />
assign - TALE Class Assignment<br />
rename - Rename TALEs in File<br />
targets - Predict and Intersect Targets<br />
presence - TALE Class Presence<br />
repdiff - TALE Repeat Differences<br />
preditale - PrediTALE<br />
dertale - DerTALE<br />
<br />
Syntax: java -jar AnnoTALEcli-1.4.1.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar AnnoTALEcli-1.4.1.jar <toolname><br />
<br />
You get a list of input parameters by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name, e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict<br />
<br />
Output:<br />
<br />
At least one parameter has not been set (correctly):<br />
<br />
Parameters of tool "TALE Prediction" (predict):<br />
g - Genome (The input Xanthomonas genome in FastA or Genbank format) = null<br />
s - Strain (The name of the strain, will be used for annotated TALEs, OPTIONAL) = null<br />
outdir - The output directory, defaults to the current working directory (.) = .<br />
<br />
You get a description of each tool by calling AnnoTALEcli-1.4.1.jar with the corresponding tool name and keyword "info", e.g.,<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict info<br />
<br />
Output:<br />
A detailed description of all tools is available in the AnnoTALE User Guide (http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf).<br />
<br />
*TALE Prediction* predicts transcription activator-like effector (TALE) genes in an input sequence, typically a 'Xanthomonas' genome.<br />
<br />
'TALE Prediction' is based in HMMer nucleotide HMM models that describe N-terminus, repeat region, and C-terminus of TALEs.<br />
<br />
The input 'Genome' may be provided in FastA or Genbank format. <br />
Optionally, you may provide a strain name that will be used in the temporary TALE names and names of output files.<br />
<br />
Regardless of the input format, 'TALE Prediction' generates output in Genbank format containing the annotations of TALE genes. If the original input has already been a Genbank file, TALE annotations are added to the existing ones.<br />
In addition, 'TALE Prediction' generates annotations in GFF format, and also outputs the DNA and AS sequences of the predicted TALEs in FastA format.<br />
<br />
'TALE Prediction' tries hard to make the CDS annotation a proper gene model, starting from a start codon and ending with a Stop. If either start or stop codon are located within the originally predicted region that is homologous to TALE genes, this original hit region is still reported as mRNA.<br />
Putative pseudo genes, e.g., with premature stop codons, are marked accordingly.<br />
<br />
The TALE DNA sequences output of 'TALE Prediction' may serve as input of the 'TALE Analysis', 'TALE Class Builder', and 'TALE Class Assignment' tools.<br />
<br />
If you experience problems using 'TALE Prediction', please contact us.<br />
<br />
=== Standard pipeline ===<br />
<br />
Assuming that your current working directory contains the AnnoTALEcli Jar file, a genome of interest (of a hypothetical 'Xoo' strain PXO999 with accesion CP1234567) in a FastA file "genome.fa", all rice promoters in a FastA file "Rice-promoters.fa", and a directory "out" designated to hold all output files, a typical AnnoTALE pipeline could look like<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar predict g=genome.fa outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar analyze t=out/TALE_DNA_sequences.fasta outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar loadAndView outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar assign c=out/Class_builder_download.xml t=out/TALE_DNA_parts.fasta s="Xoo PXO999" a="CP1234567" outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar rename r=out/TALE_names_\(Xoo_PXO999\).tsv i=out/Genbank__TALE_predictions.gb outdir=out<br />
<br />
java -jar AnnoTALEcli-1.4.1.jar targets i=Rice-promoters.fa p="TALEs in class builder" c=out/Augmented_class_builder_\(Xoo_PXO999\).xml outdir=out<br />
<br />
Afterwards, you find all output files of all those tools in the directory "out". The output files and directories are named in analogy to the names in the AnnoTALE GUI version (see [http://www.jstacs.de/downloads/AnnoTALE-UserGuide-1.0.pdf User Guide for the GUI version])<br />
<br />
==Version history==<br />
<br />
===AnnoTALE===<br />
'''Version 1.4.1'''<br />
* first version to use the updated Class Builder including a large number of recently sequence strains<br />
* minor changes to the output of the 'Load and View TALE Classes' tool, now including the accessions in the TALE sequence output<br />
* changes to the Class Builder format to account for the increased size of class hierarchy, which previously resulted in unnecessarily large files<br />
* 32bit/1GB Windows version no longer included<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.1.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4.1-6GB.exe 6GB version, 64bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.1.jar AnnoTALE 1.4.1 command line application]<br />
<br />
<br />
'''Version 1.4:'''<br />
* first version containing [[PrediTALE]] and DerTALE tools for target site prediction<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.4.jar Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.dmg 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.dmg 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.4-2GB.exe 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.4-6GB.exe 6GB version, 64bit Java]; in addition, we provide a [http://www.jstacs.de/downloads/AnnoTALE-1.4-1GB.exe 1GB version with 32bit Java] for earlier and 32bit versions of Windows. Please use this version only if absolutely necessary, as some tools may not work due to memory restrictions.<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.4.jar AnnoTALE 1.4 command line application]<br />
<br />
<br />
'''Version 1.3:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.3.jar AnnoTALE 1.3 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.dmg AnnoTALE 1.3 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.dmg AnnoTALE 1.3 6GB version]<br />
* Windows installer of AnnoTALE 1.3 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.3-2GB.exe AnnoTALE 1.3 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.3-6GB.exe AnnoTALE 1.3 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.3-1GB.exe AnnoTALE 1.3 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.3.jar AnnoTALE 1.3 command line application]<br />
<br />
Changes:<br />
* modified format of Class Builder files allowing for faster download using the "Load and View TALE Classes" tool; old Class Builder files can still be loaded<br />
* "TALE Class Presence" now also outputs a phylogenetic tree of strains based on TALEome similarities<br />
<br />
<br />
'''Version 1.2:'''<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.2.jar AnnoTALE 1.2 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.dmg AnnoTALE 1.2 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.dmg AnnoTALE 1.2 6GB version]<br />
* Windows installer of AnnoTALE 1.2 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.2-2GB.exe AnnoTALE 1.2 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.2-6GB.exe AnnoTALE 1.2 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.2-1GB.exe AnnoTALE 1.2 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.2.jar AnnoTALE 1.2 command line application]<br />
<br />
Changes:<br />
* Results and loaded files may now be renamed in the GUI by clicking on the corresponding name in the "Data" panel<br />
* Minor bugfixes and improvements of the GUI (Protocol may be erased, columns in "Data" panel renamed for clarity, consistency of paths in the open/save dialogs under Linux)<br />
* Two new tools: "TALE Class Presence" and "TALE Repeat differences"<br />
<br />
'''Version 1.1:'''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.1.jar AnnoTALE 1.1 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.dmg AnnoTALE 1.1 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.dmg AnnoTALE 1.1 6GB version]<br />
* Windows installer of AnnoTALE 1.1 including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.1-2GB.exe AnnoTALE 1.1 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.1-6GB.exe AnnoTALE 1.1 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.1-1GB.exe AnnoTALE 1.1 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.1.jar AnnoTALE 1.1 command line application]<br />
<br />
Changes:<br />
* Additional output for the "Load and View TALE Classes" tool<br />
* "TALE Class Builder" and "TALE Class Assignment" now also accept RVD sequences (separated by dashes) as input. However, this is not recommended and some features (e.g., highlighting of aberrant repeats) will not be available. Only complete TALE DNA sequences will be accepted for inclusion into the official Class Builder.<br />
* The internal help pages now link to the PDF User Guide<br />
<br />
'''Version 1.0:'''<br />
<br />
''Initial AnnoTALE release''<br />
<br />
* [http://www.jstacs.de/downloads/AnnoTALE-1.0.jar AnnoTALE 1.0 Runnable Jar] (requires Java 8, update 45 or greater)<br />
* Mac-DMG of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.dmg AnnoTALE 1.0 2GB version], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.dmg AnnoTALE 1.0 6GB version]<br />
* Windows installer of AnnoTALE including Java: [http://www.jstacs.de/downloads/AnnoTALE-1.0-2GB.exe AnnoTALE 1.0 2GB version, 64bit Java], [http://www.jstacs.de/downloads/AnnoTALE-1.0-6GB.exe AnnoTALE 1.0 6GB version, 64bit Java]; [http://www.jstacs.de/downloads/AnnoTALE-1.0-1GB.exe AnnoTALE 1.0 1GB version with 32bit Java]<br />
* [http://www.jstacs.de/downloads/AnnoTALEcli-1.0.jar AnnoTALE 1.0 command line application]<br />
<br />
=== Class Builders ===<br />
<br />
* [http://www.jstacs.de/downloads/class_definitions_10_10_2020.xml.gz Version 10/10/2019]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_20_06_2019.xml.gz Version 20/06/2019]: compatible with AnnoTALE version 1.4.1 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_current.xml.gz Version 29/09/2018]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.3 and later<br />
* [http://www.jstacs.de/downloads/class_definitions_current.xml Version 09/03/2017]: used for "Download current definition" in "Load and View TALE Classes" within AnnoTALE version 1.2 and earlier<br />
* [http://www.jstacs.de/downloads/class_definitions_11_03_2016.xml Version 03/11/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_29_01_2016.xml Version 01/29/2016]<br />
* [http://www.jstacs.de/downloads/class_definitions_19_10.xml Version 10/19/2015]: used in the AnnoTALE publication (Grau ''et al.'', Sci Rep, 2016)</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1115Catchitt2020-10-05T11:58:05Z<p>Grau: /* Version history */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.2: Bugfixes, new experimental tools for handling methylation levels<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1114Catchitt2020-10-05T11:55:39Z<p>Grau: /* Tools */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
<br />
=== Methylation levels ===<br />
''Methylation levels'' may be called with<br />
<br />
java -jar Catchitt.jar methyl<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index, mime = fai)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50<br />
<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.1: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=Catchitt&diff=1113Catchitt2020-10-05T11:53:35Z<p>Grau: /* Usage */</p>
<hr />
<div>Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.<br />
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].<br />
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.<br />
<br />
== Catchitt tools ==<br />
<br />
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.<br />
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.<br />
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.<br />
The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.<br />
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.<br />
<br />
== Downloads ==<br />
<br />
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.<br />
<br />
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''<br />
<br />
* [http://www.jstacs.de/downloads/Catchitt.jar JAR download]<br />
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.<br />
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge<br />
<br />
== Citation ==<br />
<br />
If you use Catchitt in your research, please cite<br />
<br />
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.<br />
<br />
== Usage ==<br />
<br />
Catchitt can be started by calling<br />
<br />
java -jar Catchitt.jar<br />
<br />
on the command line. This lists the names of the available tools with a short description:<br />
<br />
Available tools:<br />
<br />
access - Chromatin accessibility<br />
methyl - Methylation levels<br />
motif - Motif scores<br />
labels - Derive labels<br />
itrain - Iterative Training<br />
predict - Prediction<br />
<br />
Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]<br />
<br />
Further info about the tools is given with<br />
java -jar Catchitt.jar <toolname> info<br />
<br />
Tool parameters are listed with<br />
java -jar Catchitt.jar <toolname><br />
<br />
== Tools ==<br />
<br />
=== Derive labels ===<br />
<br />
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Derive labels'' may be called with<br />
<br />
java -jar Catchitt.jar labels<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Conservative peaks (NarrowPeak file containing the conservative peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>Relaxed peaks (NarrowPeak file containing the relaxed peaks)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rw</font></td><br />
<td>Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels<br />
<br />
<br />
=== Chromatin accessibility ===<br />
<br />
''Chromatin accessibility'' computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Chromatin accessibility'' may be called with<br />
<br />
java -jar Catchitt.jar access<br />
<br />
and has the following parameters<br />
<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;BAM/SAM&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Bigwig&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FastA index (The genome index)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase<br />
<br />
=== Motif scores ===<br />
<br />
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar Catchitt.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)<table border=0 cellpadding=10 align="center"><br />
<tr><td colspan=3>Parameters for selection &quot;Dimont&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;HOCOMOCO&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td>FILE</td><br />
</tr><br />
<tr><td colspan=3>Parameters for selection &quot;Jaspar&quot;:</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td>FILE</td><br />
</tr><br />
</table></td><td></td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td>BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs<br />
<br />
=== Iterative Training ===<br />
<br />
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Iterative Training'' may be called with<br />
<br />
java -jar Catchitt.jar itrain<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features), MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Labels (File containing the labels)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">itc</font></td><br />
<td>Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td>DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls<br />
<br />
=== Prediction ===<br />
<br />
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Prediction'' may be called with<br />
<br />
java -jar Catchitt.jar predict<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Classifiers (The classifiers trained by iterative training)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Accessibility (File containing accessibility features)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td>FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">abb</font></td><br />
<td>Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aba</font></td><br />
<td>Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)</td><br />
<td>INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example'''<br />
<br />
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict<br />
<br />
<br />
== Standard pipeline ==<br />
<br />
The standard Catchitt pipeline would comprise the following steps<br />
<br />
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')<br />
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')<br />
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')<br />
* perform iterative training given the labels and feature files (''Iterative Training'')<br />
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')<br />
<br />
<br />
== Tutorial using ENCODE data ==<br />
<br />
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.<br />
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.<br />
<br />
For best performance, we would further recommend<br />
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks<br />
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]<br />
<br />
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.<br />
<br />
=== Obtaining training and test data ===<br />
<br />
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at<br />
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
<br />
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:<br />
<br />
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz<br />
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
<br />
In the following, we assume that genome FastA and index are in the base directory.<br />
<br />
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.<br />
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:<br />
<br />
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig<br />
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig<br />
<br />
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:<br />
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from<br />
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz<br />
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz<br />
<br />
Again, the peak files need to be gunzipped for the following steps.<br />
<br />
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case<br />
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
We organize all these files (and the Catchitt JAR) in the following directory structure<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
=== Deriving labels ===<br />
<br />
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.<br />
<br />
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling<br />
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\<br />
r=astrocytes/ENCFF600CYD.bed\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 rw=200 outdir=astrocytes/labels<br />
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].<br />
The result is a file astrocytes/labels/Labels.tsv.gz with the following format<br />
chr1 0 U<br />
chr1 50 U<br />
chr1 100 U<br />
chr1 150 U<br />
chr1 200 U<br />
chr1 250 U<br />
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.<br />
<br />
=== Preparing DNase data from bigwig format ===<br />
<br />
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.<br />
<br />
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=astrocytes/access<br />
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format<br />
<br />
chr1 1033400 0.03954650089144707 0.05627769976854324 0.009126120246946812 0.030420400202274323 0.06692489981651306 1.03125 3.0 1.0 0.0<br />
chr1 1033450 0.030420400202274323 0.03650449961423874 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 0.0 0.0<br />
chr1 1033500 0.024336300790309906 0.03346240147948265 0.009126120246946812 0.030420400202274323 0.045630600303411484 1.03125 2.0 1.0 0.0<br />
chr1 1033550 0.01825219951570034 0.024336300790309906 0.009126120246946812 0.024336300790309906 0.060840800404548645 1.03125 2.0 0.0 1.0<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are<br />
* minimum DNase value in bin,<br />
* median DNase value in bin,<br />
* minimum in 1000 bp after bin start,<br />
* minimum in 1000 bp before bin start,<br />
* maximum in 1000 bp after bin start,<br />
* maximum in 1000 bp before bin start,<br />
* the number of steps in the bin profile,<br />
* the length of the longest monotonically increasing range in the bin,<br />
* the length of the longest monotonically decreasing range in the bin.<br />
<br />
=== Preparing motif scores ===<br />
<br />
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:<br />
<br />
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3<br />
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format<br />
<br />
chr1 46950 -4.996643 -4.9543528358429105<br />
chr1 47000 -5.984124 -5.451674735652041<br />
chr1 47050 -0.8633305 -0.4596223585537509<br />
chr1 47100 -4.9379983 -4.813470561120627<br />
<br />
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are<br />
* the maximum motif score within the bin,<br />
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.<br />
<br />
=== Iterative training ===<br />
<br />
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.<br />
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''<br />
<br />
We start iterative training by calling<br />
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\<br />
l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8<br />
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.<br />
<br />
=== Predicting binding in new cell types ===<br />
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":<br />
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
p="chr8" outdir=fibroblasts/predict<br />
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.<br />
This file has three columns, corresponding to chromosome, starting position, and binding probability:<br />
<br />
chr8 265850 0.9866555574053496<br />
chr8 265900 0.9865107771922306<br />
chr8 265950 0.9864837006927715<br />
chr8 266000 0.8041139249973046<br />
chr8 266050 0.19870629729482686<br />
chr8 266100 0.1302269536110939<br />
chr8 266150 0.09693322015563202<br />
<br />
<br />
=== Using DNase-seq BAM files and multiple motifs ===<br />
<br />
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.<br />
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE<br />
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam<br />
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam<br />
<br />
and sort them into the directory structure.<br />
<br />
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].<br />
<br />
Afterwards, the directory structure should look like<br />
<br />
.:<br />
Catchitt.jar<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta<br />
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai<br />
<br />
./astrocytes:<br />
ENCFF183YLB.bed<br />
ENCFF600CYD.bed<br />
ENCFF901UBX.bigWig<br />
ENCFF384CCQ.bam<br />
<br />
./fibroblasts:<br />
ENCFF652HJH.bigWig<br />
ENCFF368XNE.bam<br />
<br />
./motifs/CTCF/:<br />
CTCF_HUMAN.H11MO.0.A.pwm<br />
<br />
./motifs/CTCF_Slim:<br />
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/JUND_Slim:<br />
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/MAX_Slim:<br />
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml<br />
<br />
./motifs/SP1:<br />
ENCSR000BHK_SP1-human_1_hg19-model-2.xml<br />
<br />
<br />
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:<br />
<br />
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/<br />
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/<br />
<br />
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:<br />
<br />
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/SP1 threads=3<br />
<br />
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:<br />
<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/CTCF_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/JUND_Slim l=false threads=3<br />
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\<br />
g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\<br />
b=50 outdir=motifs/MAX_Slim l=false threads=3<br />
<br />
Finally, we start the iterative training using the new feature files:<br />
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\<br />
t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\<br />
itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8<br />
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.<br />
<br />
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.<br />
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\<br />
m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\<br />
m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\<br />
f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs<br />
<br />
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.<br />
<br />
== Version history ==<br />
<br />
* Catchitt v0.1.1: Bugfixes for border cases; reduced debugging output<br />
<br />
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1097MeDeMo2020-04-28T21:55:44Z<p>Grau: /* Methylation Sensitivity */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: <br />
** [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires installed Java >= 1.8 and JavaFX)<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.zip Windows ZIP]: within the ZIP archive, you find the JAR and a custom Java runtime environment; to run MeDeMo, just double-click run.bat<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.app.zip Mac App]: within the ZIP archive, you find a Mac-App, which you can copy anywhere you like (e.g., your /Applications folder) and run the app by double-clicking it; depending on your security settings, it might be necessary to use Right-click -> Open when opening MeDeMo for the first time and explicitly allow it to run; it might also be necessary to disable "App Nap" (Right-click -> GetInfo -> Prevent App Nap)<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
Example data (also used for the code examples below) are [http://www.jstacs.de/downloads/MeDeMo-examples.zip available for download].<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequence column (The column of the predictions file containing the sequences in adjusted strand orientation, default = 8)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1089MeDeMo2020-03-23T23:00:08Z<p>Grau: /* Download */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: <br />
** [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires installed Java >= 1.8 and JavaFX)<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.zip Windows ZIP]: within the ZIP archive, you find the JAR and a custom Java runtime environment; to run MeDeMo, just double-click run.bat<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.app.zip Mac App]: within the ZIP archive, you find a Mac-App, which you can copy anywhere you like (e.g., your /Applications folder) and run the app by double-clicking it; depending on your security settings, it might be necessary to use Right-click -> Open when opening MeDeMo for the first time and explicitly allow it to run; it might also be necessary to disable "App Nap" (Right-click -> GetInfo -> Prevent App Nap)<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
Example data (also used for the code examples below) are [http://www.jstacs.de/downloads/MeDeMo-examples.zip available for download].<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1088MeDeMo2020-03-21T13:16:22Z<p>Grau: /* Download */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: <br />
** [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires installed Java >= 1.8 and JavaFX)<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.zip Windows ZIP]: within the ZIP archive, you find the JAR and a custom Java runtime environment; to run MeDeMo, just double-click run.bat<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.app.zip Mac App]: within the ZIP archive, you find a Mac-App, which you can copy anywhere you like (e.g., your /Applications folder) and run the app by double-clicking it; depending on your security settings, it might be necessary to use Right-click -> Open when opening MeDeMo for the first time and explicitly allow it to run; it might also be necessary to disable "App Nap" (Right-click -> GetInfo -> Prevent App Nap)<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
Examples data (also used for the code examples below) is [http://www.jstacs.de/downloads/MeDeMo-examples.tgz available for download].<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1087MeDeMo2020-03-21T11:21:14Z<p>Grau: /* Download */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: <br />
** [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires installed Java >= 1.8 and JavaFX)<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.zip Windows ZIP]: within the ZIP archive, you find the JAR and a custom Java runtime environment; to run MeDeMo, just double-click run.bat<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.app.zip Mac App]: within the ZIP archive, you find a Mac-App, which you can copy anywhere you like (e.g., your /Applications folder) and run the app by double-clicking it; depending on your security settings, it might be necessary to use Right-click -> Open when opening MeDeMo for the first time and explicitly allow it to run; it might also be necessary to disable App (Right-click -> GetInfo -> Prevent App Nap)<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
Examples data (also used for the code examples below) is [http://www.jstacs.de/downloads/MeDeMo-examples.tgz available for download].<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1086MeDeMo2020-03-21T00:29:14Z<p>Grau: /* Download */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: <br />
** [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires installed Java >= 1.8 and JavaFX)<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.zip Windows ZIP]: within the ZIP archive, you find the JAR and a custom Java runtime environment; to run MeDeMo, just double-click run.bat<br />
** [http://www.jstacs.de/downloads/MeDeMo-1.0.dmg Mac DMG]: Mac disk image; on the mounted image, you find a Mac-App, which you can copy anywhere you like (e.g., your /Applications folder) and run the app by double-clicking it<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
Examples data (also used for the code examples below) is [http://www.jstacs.de/downloads/MeDeMo-examples.tgz available for download].<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1085MeDeMo2020-03-19T13:15:40Z<p>Grau: /* Download */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires Java >= 1.8 and JavaFX), [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.exe Windows installer], [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.dmg Mac DMG].<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
Examples data (also used for the code examples below) is [http://www.jstacs.de/downloads/MeDeMo-examples.tgz available for download].<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1084MeDeMo2020-03-19T13:14:29Z<p>Grau: /* Methylation Sensitivity */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires Java >= 1.8 and JavaFX), [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.exe Windows installer], [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.dmg Mac DMG].<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1083MeDeMo2020-03-19T13:14:07Z<p>Grau: /* Quick Prediction Tool */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires Java >= 1.8 and JavaFX), [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.exe Windows installer], [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.dmg Mac DMG].<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1082MeDeMo2020-03-19T13:13:41Z<p>Grau: /* Motif scores */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires Java >= 1.8 and JavaFX), [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.exe Windows installer], [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.dmg Mac DMG].<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1081MeDeMo2020-03-19T13:10:17Z<p>Grau: /* Evaluate Scoring */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires Java >= 1.8 and JavaFX), [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.exe Windows installer], [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.dmg Mac DMG].<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens</div>Grauhttps://www.jstacs.de/index.php?title=MeDeMo&diff=1080MeDeMo2020-03-19T13:09:47Z<p>Grau: /* Sequence Scoring */</p>
<hr />
<div>Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.<br />
<br />
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.<br />
<br />
<br />
== Download ==<br />
<br />
MeDeMo is available as<br />
* [http://www.jstacs.de/downloads/MeDeMo-1.0.jar command line interface] version and<br />
* graphical user interface version: [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.jar JAR file] (requires Java >= 1.8 and JavaFX), [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.exe Windows installer], [http://www.jstacs.de/downloads/MeDeMoGUI-1.0.dmg Mac DMG].<br />
<br />
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.<br />
<br />
<br />
== Tools ==<br />
<br />
The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.<br />
<br />
<br />
=== Data Extractor ===<br />
<br />
'''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter &quot;Width&quot;.<br />
<br />
In case of ChIP data, the center position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
<br />
<br />
If you experience problems using Data Extractor, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Data Extractor'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar extract<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (The FastA containing all chromosome sequences, may be gzipped)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Peaks (The file containing the peaks in tabular format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Chromosome column (The column of the peaks file containing the chromosome, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pp</font></td><br />
<td>Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Peak center&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cc</font></td><br />
<td>Center column (The column of the peaks file containing the peak center relative to the start position)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;End of peak&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1<br />
<br />
=== Methyl SlimDimont ===<br />
<br />
'''Methyl SlimDimont''' is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.<br />
<br />
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool.<br />
Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest.<br />
Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. <br />
In case of ChIP data, this anchor position could for instance be the peak summit.<br />
An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:<br />
<br />
> peak: 50; signal: 515<br />
ggccatgtgtatttttttaaatttccac...<br />
> peak: 50; signal: 199<br />
GGTCCCCTGGGAGGATGGGGACGTGCTG...<br />
...<br />
<br />
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.<br />
The FastA comment may contain additional annotations of the format <code>key1 : value1; key2: value2;...</code>.<br />
<br />
Accordingly, you would need to set the parameter &quot;Position tag&quot; to <code>peak</code> and the parameter &quot;Value tag&quot; to <code>signal</code> for the input file (default values).<br />
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.<br />
<br />
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. <br />
However, you may want adjust these parameters to meet your prior information.<br />
<br />
The parameter &quot;Markov order of the motif model&quot; sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to <code>0</code>, you obtain a position weight matrix (PWM) model. <br />
If it is set to <code>1</code>, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most <code>3</code>.<br />
<br />
The parameter &quot;Markov order of the background model&quot; sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. <br />
If this parameter is set to <code>-1</code>, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to <code>4</code> resulted in an increased prediction performance in our case studies. The maximum allowed value is <code>5</code>.<br />
<br />
The parameter &quot;Weighting factor&quot; defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of <code>0.2</code> typically works well. <br />
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. <code>0.01</code>.<br />
<br />
The &quot;Equivalent sample size&quot; reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.<br />
<br />
The parameter &quot;Delete BSs from profile&quot; defines if BSs of already discovered motifs should be deleted, i.e., &quot;blanked out&quot;, from the sequence before searching for futher motifs.<br />
<br />
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. <br />
There you can also download a command line version of Dimont.<br />
<br />
If you experience problems using Methyl SlimDimont, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methyl SlimDimont'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;background file&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bf</font></td><br />
<td>Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;shuffled input&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Starts</font></td><br />
<td>Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">imw</font></td><br />
<td>Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;LSlim model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">md</font></td><br />
<td>Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Markov model&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>Order (The order of the Markov model, valid range = [0, 5], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">moobm</font></td><br />
<td>Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">afs</font></td><br />
<td>Adjust for shifts (Adjust for shifts of the motif., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8<br />
<br />
=== Sequence Scoring ===<br />
<br />
'''Sequence Scoring''' scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by &quot;Methyl SlimDimont&quot; and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.<br />
<br />
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.<br />
<br />
If you experience problems using Sequence Scoring, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Sequence Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar score<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>Input sequences (Input sequences in FastA format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (Model XML as output by Methyl SlimDimont)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878<br />
<br />
=== Evaluate Scoring ===<br />
<br />
'''Evaluate Scoring''' computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.<br />
<br />
''Evaluate Scoring'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Positives (Output of "Sequence Scoring" for positive test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Negatives (Output of "Sequence Scoring" for negative test sequences.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>Curves (Also compute and draw curves, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar eval<br />
<br />
<br />
=== Motif scores ===<br />
<br />
'''Motif scores''' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file ''Motif_scores.tsv.gz'' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.<br />
<br />
''Motif scores'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;Dimont&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont motif (Dimont motif model description)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;HOCOMOCO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>HOCOMOCO PWM (PWM from the HOCOMOCO database)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;Jaspar&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">j</font></td><br />
<td>Jaspar PFM (PFM in Jaspar format)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>Genome (Genome as FastA file)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>FAI of genome (FastA index file of the genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Bin width (The width of the genomic bins considered)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar motif<br />
<br />
<br />
=== Quick Prediction Tool ===<br />
<br />
'''Quick Prediction Tool''' predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont. <br />
<br />
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter &quot;Background sample&quot;).<br />
<br />
If you experience problems using Quick Prediction Tool, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
''Quick Prediction Tool'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>Dimont model (The model returned by Dimont (in XML format))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Sequences (The sequences (e.g., a genome) to scan for binding sites)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;sub-sample&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;background sequences&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">bs</font></td><br />
<td>Background sequences (The sequences (e.g., a genome) for determining the prediction threshold)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;significance level&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sl</font></td><br />
<td>Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;number of sites&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar quickpred<br />
<br />
<br />
=== Methylation Sensitivity ===<br />
<br />
'''Methylation Sensitivity''' determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by &quot;Methyl SlimDimont&quot;, and a prediction file as output from the corresponding training run.<br />
<br />
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.<br />
<br />
If you experience problems using Methylation Sensitivity, please [mailto:grau@informatik.uni-halle.de contact] us.<br />
<br />
<br />
<br />
''Methylation Sensitivity'' may be called with<br />
<br />
java -jar MeDeMo-1.0.jar msens<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>Model (The XML file containing the model)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>Predictions (The file containing the predictions from the training run)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>Verbose (Output MpH sensitivity profile for every input sequence, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar MeDeMo-1.0.jar msens</div>Grau