GeMoSeq: Difference between revisions
m (→Examples) |
|||
| (35 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
GeMoSeq reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format. | |||
It is intended as a companion for the homology-based gene prediction program [[GeMoMa]]. | |||
In a typical workflow, predictions of transcript models may be obtained from GeMoSeq for a collection of BAM files individually and subsequently merged using the [[GeMoMa]] Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using [[GeMoMa]] and the resulting GFF files may be merged using the [[#Merge|Merge]] tool of GeMoSeq. | |||
== Command line tool == | |||
java -jar | ''GeMoSeq is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.'' | ||
GeMoSeq and auxiliary tools are packaged in one [http://www.jstacs.de/downloads/GeMoSeq-1.2.3.jar runnable JAR] that may be run from the command line with | |||
java -jar GeMoSeq-1.2.3.jar | |||
which lists the tools available and usage information | |||
Available tools: | |||
gemoseq - GeMoSeq | |||
predictCDS - Predict CDS from GFF | |||
GAF - GeMoMa Annotation Filter | |||
Analyzer - Analyzer | |||
merge - Merge | |||
Syntax: java -jar GeMoSeq-1.2.3.jar <toolname> [<parameter=value> ...] | |||
Further info about the tools is given with | |||
java -jar GeMoSeq-1.2.3.jar <toolname> info | |||
For tests of individual tools: | |||
java -jar GeMoSeq-1.2.3.jar <toolname> test [<verbose>] | |||
Tool parameters are listed with | |||
java -jar GeMoSeq-1.2.3.jar <toolname> | |||
You get a list of the tool parameters by calling GeMoSeq-1.2.3.jar with the corresponding tool name, e.g., | |||
java -jar GeMoSeq-1.2.3.jar gemoseq | |||
The meaning of the individual tool parameters is described below. | |||
For convenience, we also include the [[GeMoMa]] tools Analyzer and GAF. | |||
== Source code == | |||
The source code of GeMoSeq is available from the [https://github.com/Jstacs/Jstacs/tree/master/projects/gemoseq Jstacs GitHub repository]. | |||
== Examples == | |||
We give examples for applying GeMoSeq to a single sequencing library and for a larger-scale, integrated genome annotation together with GeMoMa [[GeMoSeq-Examples|on a separate wiki page]]. | |||
== GeMoSeq == | |||
Prediction of transcript models using GeMoSeq. | |||
''GeMoSeq'' may be called with | |||
java -jar GeMoSeq-1.2.3.jar gemoseq | |||
and has the following parameters | and has the following parameters | ||
| Line 97: | Line 147: | ||
<td>Maximum region length (Maximum length of a region considered before it is split, default = 750000)</td> | <td>Maximum region length (Maximum length of a region considered before it is split, default = 750000)</td> | ||
<td style="width:100px;">INT</td> | <td style="width:100px;">INT</td> | ||
</tr> | |||
<tr style="vertical-align:top"> | |||
<td><font color="green">mrc</font></td> | |||
<td>Maximum region coverage (Maximum coverage in a region before reads are down-sampled, valid range = [0.0, Infinity], default = 100.0)</td> | |||
<td style="width:100px;">DOUBLE</td> | |||
</tr> | </tr> | ||
<tr style="vertical-align:top"> | <tr style="vertical-align:top"> | ||
| Line 112: | Line 167: | ||
<td>Minimum protein length (Minimum length of protein in AA, default = 70)</td> | <td>Minimum protein length (Minimum length of protein in AA, default = 70)</td> | ||
<td style="width:100px;">INT</td> | <td style="width:100px;">INT</td> | ||
</tr> | |||
<tr style="vertical-align:top"> | |||
<td><font color="green">gp</font></td> | |||
<td>Gene prefix (Prefix to add to all gene names, default = G)</td> | |||
<td style="width:100px;">STRING</td> | |||
</tr> | |||
<tr style="vertical-align:top"> | |||
<td><font color="green">gnwc</font></td> | |||
<td>Gene names with chromosome (If true, gene names will be constructed as <Gene prefix><chr>.<geneNumber>. Gene numbers will be assigned successively across all chromosomes., default = false)</td> | |||
<td style="width:100px;">BOOLEAN</td> | |||
</tr> | </tr> | ||
<tr style="vertical-align:top"> | <tr style="vertical-align:top"> | ||
| Line 120: | Line 185: | ||
<tr style="vertical-align:top"> | <tr style="vertical-align:top"> | ||
<td><font color="green">threads</font></td> | <td><font color="green">threads</font></td> | ||
<td>The number of threads used for the tool, defaults to 1</td> | <td>The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoSeq runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoSeq with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot.</td> | ||
<td>INT</td> | <td>INT</td> | ||
</tr> | </tr> | ||
| Line 127: | Line 192: | ||
'''Example:''' | '''Example:''' | ||
java -jar | java -jar GeMoSeq-1.2.3.jar gemoseq g=<Genome> m=<Mapped_reads> | ||
== Predict CDS from GFF == | |||
| Line 136: | Line 200: | ||
''Predict CDS from GFF'' may be called with | ''Predict CDS from GFF'' may be called with | ||
java -jar | java -jar GeMoSeq-1.2.3.jar predictCDS | ||
and has the following parameters | and has the following parameters | ||
| Line 171: | Line 235: | ||
'''Example:''' | '''Example:''' | ||
java -jar | java -jar GeMoSeq-1.2.3.jar predictCDS g=<Genome> p=<predicted_annotation> | ||
== Merge == | |||
| Line 271: | Line 244: | ||
''Merge'' may be called with | ''Merge'' may be called with | ||
java -jar | java -jar GeMoSeq-1.2.3.jar merge | ||
and has the following parameters | and has the following parameters | ||
| Line 288: | Line 261: | ||
</tr> | </tr> | ||
<tr style="vertical-align:top"> | <tr style="vertical-align:top"> | ||
<td><font color="green"> | <td><font color="green">GeMoSeq</font></td> | ||
<td> | <td>GeMoSeq (GeMoSeq predictions, type = gff,gff3)</td> | ||
<td style="width:100px;">FILE</td> | <td style="width:100px;">FILE</td> | ||
</tr> | </tr> | ||
| Line 306: | Line 279: | ||
</tr> | </tr> | ||
<tr style="vertical-align:top"> | <tr style="vertical-align:top"> | ||
<td><font color="green"> | <td><font color="green">GeMoSeq-strict</font></td> | ||
<td> | <td>GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)</td> | ||
<td style="width:100px;">FILE</td> | <td style="width:100px;">FILE</td> | ||
</tr> | </tr> | ||
| Line 317: | Line 290: | ||
</tr> | </tr> | ||
<tr style="vertical-align:top"> | <tr style="vertical-align:top"> | ||
<td><font color="green"> | <td><font color="green">GeMoSeq-strict</font></td> | ||
<td> | <td>GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)</td> | ||
<td style="width:100px;">FILE</td> | <td style="width:100px;">FILE</td> | ||
</tr> | |||
<tr style="vertical-align:top"> | |||
<td><font color="green">l</font></td> | |||
<td>Low-confidence (include low-confidence predictions, default = true)</td> | |||
<td style="width:100px;">BOOLEAN</td> | |||
</tr> | </tr> | ||
</table></td></tr> | </table></td></tr> | ||
| Line 331: | Line 309: | ||
'''Example:''' | '''Example:''' | ||
java -jar | java -jar GeMoSeq-1.2.3.jar merge g=<GeMoMa> GeMoSeq=<GeMoSeq> | ||
== Version history == | |||
* Version 1.2.3 (2025/11/11): Renamed the tool to GeMoSeq and improved prediction from long-read data | |||
* [http://www.jstacs.de/downloads/GeMoRNA-1.2.1.jar Version 1.2.1] (2025/05/28): improved handling of exceptions in multi-thread mode | |||
* [http://www.jstacs.de/downloads/GeMoRNA-1.2.jar Version 1.2] (2025/05/12): changes in the following tools | |||
** gemorna: fixed a problem where (incomplete) CDS would be predicted in transcripts without any proper stop codon | |||
* [http://www.jstacs.de/downloads/GeMoRNA-1.1.jar Version 1.1] (2025/04/15): changes in the following tools | |||
** merge: include flag if low-confidence predictions will be included in "annotate" mode | |||
** gemorna: allow to provide custom prefix for gene names and to include the chromosome into the gene names | |||
* [http://www.jstacs.de/downloads/GeMoRNA-1.0.jar Version 1.0]: initial version of GeMoRNA | |||
Latest revision as of 16:08, 18 November 2025
GeMoSeq reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.
It is intended as a companion for the homology-based gene prediction program GeMoMa.
In a typical workflow, predictions of transcript models may be obtained from GeMoSeq for a collection of BAM files individually and subsequently merged using the GeMoMa Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using GeMoMa and the resulting GFF files may be merged using the Merge tool of GeMoSeq.
Command line tool
GeMoSeq is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
GeMoSeq and auxiliary tools are packaged in one runnable JAR that may be run from the command line with
java -jar GeMoSeq-1.2.3.jar
which lists the tools available and usage information
Available tools: gemoseq - GeMoSeq predictCDS - Predict CDS from GFF GAF - GeMoMa Annotation Filter Analyzer - Analyzer merge - Merge Syntax: java -jar GeMoSeq-1.2.3.jar <toolname> [<parameter=value> ...] Further info about the tools is given with java -jar GeMoSeq-1.2.3.jar <toolname> info For tests of individual tools: java -jar GeMoSeq-1.2.3.jar <toolname> test [<verbose>] Tool parameters are listed with java -jar GeMoSeq-1.2.3.jar <toolname>
You get a list of the tool parameters by calling GeMoSeq-1.2.3.jar with the corresponding tool name, e.g.,
java -jar GeMoSeq-1.2.3.jar gemoseq
The meaning of the individual tool parameters is described below. For convenience, we also include the GeMoMa tools Analyzer and GAF.
Source code
The source code of GeMoSeq is available from the Jstacs GitHub repository.
Examples
We give examples for applying GeMoSeq to a single sequencing library and for a larger-scale, integrated genome annotation together with GeMoMa on a separate wiki page.
GeMoSeq
Prediction of transcript models using GeMoSeq.
GeMoSeq may be called with
java -jar GeMoSeq-1.2.3.jar gemoseq
and has the following parameters
| name | comment | type |
| g | Genome (Genome sequence as FastA, type = fa,fna,fasta) | FILE |
| m | Mapped reads (Mapped Reads in BAM format, coordinate sorted, type = bam) | FILE |
| s | Stranded (Library strandedness, range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) | STRING |
| l | Longest intron length (Length of the longest intron reported, default = 100000) | INT |
| sil | Shortest intron length (Length of the shortest intron considered, default = 10) | INT |
| lr | Long reads (Long-read mode, default = false) | BOOLEAN |
| mnor | Minimum number of reads (Minimum number of reads required for an edge in the read graph, default = 1.0) | DOUBLE |
| mfor | Minimum fraction of reads (Minimum fraction of reads relative to adjacent exons that must support an intron in the enumeration, default = 0.01) | DOUBLE |
| mnoir | Minimum number of intron reads (Minimum number of reads required for an intron, default = 1.0) | DOUBLE |
| mfoir | Minimum fraction of intron reads (Minimum fraction of reads relative to adjacent exons for an intron to be considered, default = 0.01) | DOUBLE |
| p | Percent explained (Percent of abundance that must be explained by transcript models after quantification, default = 0.9) | DOUBLE |
| mrpg | Minimum reads per gene (Minimum abundance required for a gene to be reported, default = 40.0) | DOUBLE |
| mrpt | Minimum reads per transcript (Minimum abundance required for a transcript to be reported, default = 20.0) | DOUBLE |
| pa | Percent abundance (Minimum relative abundance required for a transcript to be reported, default = 0.05) | DOUBLE |
| sf | Successive fraction (Factor of the drop in abundance between successive transcript models, default = 20.0) | DOUBLE |
| mrl | Maximum region length (Maximum length of a region considered before it is split, default = 750000) | INT |
| mrc | Maximum region coverage (Maximum coverage in a region before reads are down-sampled, valid range = [0.0, Infinity], default = 100.0) | DOUBLE |
| mfgl | Maximum filled gap length (Maximum length of a gap filled by dummy reads, default = 50) | INT |
| q | Quality filter (Minimum mapping quality required for a read to be considered, default = 40) | INT |
| mpl | Minimum protein length (Minimum length of protein in AA, default = 70) | INT |
| gp | Gene prefix (Prefix to add to all gene names, default = G) | STRING |
| gnwc | Gene names with chromosome (If true, gene names will be constructed as <Gene prefix><chr>.<geneNumber>. Gene numbers will be assigned successively across all chromosomes., default = false) | BOOLEAN |
| outdir | The output directory, defaults to the current working directory (.) | STRING |
| threads | The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoSeq runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoSeq with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot. | INT |
Example:
java -jar GeMoSeq-1.2.3.jar gemoseq g=<Genome> m=<Mapped_reads>
Predict CDS from GFF
Predict CDS from GFF may be called with
java -jar GeMoSeq-1.2.3.jar predictCDS
and has the following parameters
| name | comment | type |
| g | Genome (Genome sequence as FastA, type = fa,fna.fasta) | FILE |
| p | predicted annotation ("GFF or GTF file containing the predicted annotation", type = gff,gff3,gff.gz,gff3.gz,gtf,gtf.gz) | FILE |
| m | Minimum protein length (Minimum length of protein in AA, default = 70) | INT |
| outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoSeq-1.2.3.jar predictCDS g=<Genome> p=<predicted_annotation>
Merge
Merge may be called with
java -jar GeMoSeq-1.2.3.jar merge
and has the following parameters
| name | comment | type | |||||||||||||||||||||||||||
| g | GeMoMa (GeMoMa predictions, type = gff,gff3) | FILE | |||||||||||||||||||||||||||
| GeMoSeq | GeMoSeq (GeMoSeq predictions, type = gff,gff3) | FILE | |||||||||||||||||||||||||||
| m | Mode (, range={intersect, union, intermediate, annotate}, default = intersect) | STRING | |||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| outdir | The output directory, defaults to the current working directory (.) | STRING | |||||||||||||||||||||||||||
Example:
java -jar GeMoSeq-1.2.3.jar merge g=<GeMoMa> GeMoSeq=<GeMoSeq>
Version history
- Version 1.2.3 (2025/11/11): Renamed the tool to GeMoSeq and improved prediction from long-read data
- Version 1.2.1 (2025/05/28): improved handling of exceptions in multi-thread mode
- Version 1.2 (2025/05/12): changes in the following tools
- gemorna: fixed a problem where (incomplete) CDS would be predicted in transcripts without any proper stop codon
- Version 1.1 (2025/04/15): changes in the following tools
- merge: include flag if low-confidence predictions will be included in "annotate" mode
- gemorna: allow to provide custom prefix for gene names and to include the chromosome into the gene names
- Version 1.0: initial version of GeMoRNA