GeMoSeq: Difference between revisions

Latest revision as of 16:08, 18 November 2025

GeMoSeq reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.

It is intended as a companion for the homology-based gene prediction program GeMoMa.

In a typical workflow, predictions of transcript models may be obtained from GeMoSeq for a collection of BAM files individually and subsequently merged using the GeMoMa Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using GeMoMa and the resulting GFF files may be merged using the Merge tool of GeMoSeq.

Command line tool

GeMoSeq is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

GeMoSeq and auxiliary tools are packaged in one runnable JAR that may be run from the command line with

java -jar GeMoSeq-1.2.3.jar

which lists the tools available and usage information

Available tools:

	gemoseq - GeMoSeq
	predictCDS - Predict CDS from GFF
	GAF - GeMoMa Annotation Filter
	Analyzer - Analyzer
	merge - Merge

Syntax: java -jar GeMoSeq-1.2.3.jar <toolname> [<parameter=value> ...]

Further info about the tools is given with
	java -jar GeMoSeq-1.2.3.jar <toolname> info

For tests of individual tools:
	java -jar GeMoSeq-1.2.3.jar <toolname> test [<verbose>]

Tool parameters are listed with
	java -jar GeMoSeq-1.2.3.jar <toolname>

You get a list of the tool parameters by calling GeMoSeq-1.2.3.jar with the corresponding tool name, e.g.,

java -jar GeMoSeq-1.2.3.jar gemoseq

The meaning of the individual tool parameters is described below. For convenience, we also include the GeMoMa tools Analyzer and GAF.

Source code

The source code of GeMoSeq is available from the Jstacs GitHub repository.

Examples

We give examples for applying GeMoSeq to a single sequencing library and for a larger-scale, integrated genome annotation together with GeMoMa on a separate wiki page.

GeMoSeq

Prediction of transcript models using GeMoSeq.

GeMoSeq may be called with

java -jar GeMoSeq-1.2.3.jar gemoseq

and has the following parameters

name	comment	type

g	Genome (Genome sequence as FastA, type = fa,fna,fasta)	FILE
m	Mapped reads (Mapped Reads in BAM format, coordinate sorted, type = bam)	FILE
s	Stranded (Library strandedness, range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)	STRING
l	Longest intron length (Length of the longest intron reported, default = 100000)	INT
sil	Shortest intron length (Length of the shortest intron considered, default = 10)	INT
lr	Long reads (Long-read mode, default = false)	BOOLEAN
mnor	Minimum number of reads (Minimum number of reads required for an edge in the read graph, default = 1.0)	DOUBLE
mfor	Minimum fraction of reads (Minimum fraction of reads relative to adjacent exons that must support an intron in the enumeration, default = 0.01)	DOUBLE
mnoir	Minimum number of intron reads (Minimum number of reads required for an intron, default = 1.0)	DOUBLE
mfoir	Minimum fraction of intron reads (Minimum fraction of reads relative to adjacent exons for an intron to be considered, default = 0.01)	DOUBLE
p	Percent explained (Percent of abundance that must be explained by transcript models after quantification, default = 0.9)	DOUBLE
mrpg	Minimum reads per gene (Minimum abundance required for a gene to be reported, default = 40.0)	DOUBLE
mrpt	Minimum reads per transcript (Minimum abundance required for a transcript to be reported, default = 20.0)	DOUBLE
pa	Percent abundance (Minimum relative abundance required for a transcript to be reported, default = 0.05)	DOUBLE
sf	Successive fraction (Factor of the drop in abundance between successive transcript models, default = 20.0)	DOUBLE
mrl	Maximum region length (Maximum length of a region considered before it is split, default = 750000)	INT
mrc	Maximum region coverage (Maximum coverage in a region before reads are down-sampled, valid range = [0.0, Infinity], default = 100.0)	DOUBLE
mfgl	Maximum filled gap length (Maximum length of a gap filled by dummy reads, default = 50)	INT
q	Quality filter (Minimum mapping quality required for a read to be considered, default = 40)	INT
mpl	Minimum protein length (Minimum length of protein in AA, default = 70)	INT
gp	Gene prefix (Prefix to add to all gene names, default = G)	STRING
gnwc	Gene names with chromosome (If true, gene names will be constructed as <Gene prefix><chr>.<geneNumber>. Gene numbers will be assigned successively across all chromosomes., default = false)	BOOLEAN
outdir	The output directory, defaults to the current working directory (.)	STRING
threads	The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoSeq runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoSeq with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot.	INT

Example:

java -jar GeMoSeq-1.2.3.jar gemoseq g=<Genome> m=<Mapped_reads>

Predict CDS from GFF

Predict CDS from GFF may be called with

java -jar GeMoSeq-1.2.3.jar predictCDS

and has the following parameters

name	comment	type

g	Genome (Genome sequence as FastA, type = fa,fna.fasta)	FILE
p	predicted annotation ("GFF or GTF file containing the predicted annotation", type = gff,gff3,gff.gz,gff3.gz,gtf,gtf.gz)	FILE
m	Minimum protein length (Minimum length of protein in AA, default = 70)	INT
outdir	The output directory, defaults to the current working directory (.)	STRING

Example:

java -jar GeMoSeq-1.2.3.jar predictCDS g=<Genome> p=<predicted_annotation>

Merge

Merge may be called with

java -jar GeMoSeq-1.2.3.jar merge

and has the following parameters

name

comment

type

g

GeMoMa (GeMoMa predictions, type = gff,gff3)

FILE

GeMoSeq

GeMoSeq (GeMoSeq predictions, type = gff,gff3)

FILE

m

Mode (, range={intersect, union, intermediate, annotate}, default = intersect)

STRING

No parameters for selection "intersect"
No parameters for selection "union"
Parameters for selection "intermediate":
GeMoMa-strict	GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3)	FILE
GeMoSeq-strict	GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)	FILE
Parameters for selection "annotate":
GeMoMa-strict	GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3)	FILE
GeMoSeq-strict	GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)	FILE
l	Low-confidence (include low-confidence predictions, default = true)	BOOLEAN

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoSeq-1.2.3.jar merge g=<GeMoMa> GeMoSeq=<GeMoSeq>

Version history

Version 1.2.3 (2025/11/11): Renamed the tool to GeMoSeq and improved prediction from long-read data
Version 1.2.1 (2025/05/28): improved handling of exceptions in multi-thread mode
Version 1.2 (2025/05/12): changes in the following tools
- gemorna: fixed a problem where (incomplete) CDS would be predicted in transcripts without any proper stop codon
Version 1.1 (2025/04/15): changes in the following tools
- merge: include flag if low-confidence predictions will be included in "annotate" mode
- gemorna: allow to provide custom prefix for gene names and to include the chromosome into the gene names
Version 1.0: initial version of GeMoRNA

GeMoSeq: Difference between revisions

Latest revision as of 16:08, 18 November 2025

Contents

Command line tool

Source code

Examples

GeMoSeq

Predict CDS from GFF

Merge

Version history

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Documentation

Tools

@@ Line 1: / Line 1: @@
-== Tools ==
+GeMoSeq reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.
-=== GeMoRNA ===
+It is intended as a companion for the homology-based gene prediction program [[GeMoMa]].
+In a typical workflow, predictions of transcript models may be obtained from GeMoSeq for a collection of BAM files individually and subsequently merged using the [[GeMoMa]] Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using [[GeMoMa]] and the resulting GFF files may be merged using the [[#Merge|Merge]] tool of GeMoSeq.
-''GeMoRNA'' may be called with
+== Command line tool ==
-  java -jar GeMoRNA-1.0.jar gemorna
+''GeMoSeq is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''
+GeMoSeq and auxiliary tools are packaged in one [http://www.jstacs.de/downloads/GeMoSeq-1.2.3.jar runnable JAR] that may be run from the command line with
+ java -jar GeMoSeq-1.2.3.jar
+which lists the tools available and usage information
+ Available tools:
+ 	gemoseq - GeMoSeq
+ 	predictCDS - Predict CDS from GFF
+ 	GAF - GeMoMa Annotation Filter
+ 	Analyzer - Analyzer
+ 	merge - Merge
+ Syntax: java -jar GeMoSeq-1.2.3.jar <toolname> [<parameter=value> ...]
+ Further info about the tools is given with
+ 	java -jar GeMoSeq-1.2.3.jar <toolname> info
+ For tests of individual tools:
+ 	java -jar GeMoSeq-1.2.3.jar <toolname> test [<verbose>]
+ Tool parameters are listed with
+ 	java -jar GeMoSeq-1.2.3.jar <toolname>
+You get a list of the tool parameters by calling GeMoSeq-1.2.3.jar with the corresponding tool name, e.g.,
+  java -jar GeMoSeq-1.2.3.jar gemoseq
+The meaning of the individual tool parameters is described below.
+For convenience, we also include the [[GeMoMa]] tools Analyzer and GAF.
+== Source code ==
+The source code of GeMoSeq is available from the [https://github.com/Jstacs/Jstacs/tree/master/projects/gemoseq Jstacs GitHub repository].
+== Examples ==
+We give examples for applying GeMoSeq to a single sequencing library and for a larger-scale, integrated genome annotation together with GeMoMa [[GeMoSeq-Examples|on a separate wiki page]].
+== GeMoSeq ==
+Prediction of transcript models using GeMoSeq.
+''GeMoSeq'' may be called with
+ java -jar GeMoSeq-1.2.3.jar gemoseq
 and has the following parameters
@@ Line 97: / Line 147: @@
 <td>Maximum region length (Maximum length of a region considered before it is split, default = 750000)</td>
 <td style="width:100px;">INT</td>
+</tr>
+<tr style="vertical-align:top">
+<td><font color="green">mrc</font></td>
+<td>Maximum region coverage (Maximum coverage in a region before reads are down-sampled, valid range = [0.0, Infinity], default = 100.0)</td>
+<td style="width:100px;">DOUBLE</td>
 </tr>
 <tr style="vertical-align:top">
@@ Line 112: / Line 167: @@
 <td>Minimum protein length (Minimum length of protein in AA, default = 70)</td>
 <td style="width:100px;">INT</td>
+</tr>
+<tr style="vertical-align:top">
+<td><font color="green">gp</font></td>
+<td>Gene prefix (Prefix to add to all gene names, default = G)</td>
+<td style="width:100px;">STRING</td>
+</tr>
+<tr style="vertical-align:top">
+<td><font color="green">gnwc</font></td>
+<td>Gene names with chromosome (If true, gene names will be constructed as <Gene prefix><chr>.<geneNumber>. Gene numbers will be assigned successively across all chromosomes., default = false)</td>
+<td style="width:100px;">BOOLEAN</td>
 </tr>
 <tr style="vertical-align:top">
@@ Line 120: / Line 185: @@
 <tr style="vertical-align:top">
 <td><font color="green">threads</font></td>
-<td>The number of threads used for the tool, defaults to 1</td>
+<td>The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoSeq runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoSeq with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot.</td>
 <td>INT</td>
 </tr>
@@ Line 127: / Line 192: @@
 '''Example:'''
-  java -jar GeMoRNA-1.0.jar gemorna g=&lt;Genome&gt; m=&lt;Mapped_reads&gt;
+  java -jar GeMoSeq-1.2.3.jar gemoseq g=&lt;Genome&gt; m=&lt;Mapped_reads&gt;
+== Predict CDS from GFF ==
-=== Predict CDS from GFF ===
@@ Line 136: / Line 200: @@
 ''Predict CDS from GFF'' may be called with
-  java -jar GeMoRNA-1.0.jar predictCDS
+  java -jar GeMoSeq-1.2.3.jar predictCDS
 and has the following parameters
@@ Line 171: / Line 235: @@
 '''Example:'''
-  java -jar GeMoRNA-1.0.jar predictCDS g=&lt;Genome&gt; p=&lt;predicted_annotation&gt;
+  java -jar GeMoSeq-1.2.3.jar predictCDS g=&lt;Genome&gt; p=&lt;predicted_annotation&gt;
-=== Analyzer ===
-This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.
-True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.
-The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.
-For more information please visit http://www.jstacs.de/index.php/GeMoMa
-If you have any questions, comments or bugs, please check FAQs on our homepage, our github page https://github.com/Jstacs/Jstacs/labels/GeMoMa or contact jens.keilwagen@julius-kuehn.de
-''Analyzer'' may be called with
- java -jar GeMoRNA-1.0.jar Analyzer
-and has the following parameters
-<table border=0 cellpadding=10 align="center" width="100%">
-<tr>
-<td>name</td>
-<td>comment</td>
-<td>type</td>
-</tr>
-<tr><td colspan=3><hr></td></tr>
-<tr style="vertical-align:top">
-<td><font color="green">t</font></td>
-<td>truth (the true annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
-<td style="width:100px;">FILE</td>
-</tr>
-<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
-<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
-<tr style="vertical-align:top">
-<td><font color="green">n</font></td>
-<td>name (can be used to distinguish different predictions, OPTIONAL)</td>
-<td style="width:100px;">STRING</td>
-</tr>
-<tr style="vertical-align:top">
-<td><font color="green">p</font></td>
-<td>predicted annotation (GFF/GTF file containing the predicted annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
-<td style="width:100px;">FILE</td>
-</tr>
-</table>
-</td></tr>
-<tr style="vertical-align:top">
-<td><font color="green">c</font></td>
-<td>CDS (if true CDS features are used otherwise exon features, default = true)</td>
-<td style="width:100px;">BOOLEAN</td>
-</tr>
-<tr style="vertical-align:top">
-<td><font color="green">o</font></td>
-<td>only introns (if true only intron borders (=splice sites) are evaluated, default = false)</td>
-<td style="width:100px;">BOOLEAN</td>
-</tr>
-<tr style="vertical-align:top">
-<td><font color="green">w</font></td>
-<td>write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO)</td>
-<td style="width:100px;">STRING</td></tr>
-<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
-<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
-<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
-<tr style="vertical-align:top">
-<td><font color="green">ca</font></td>
-<td>common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5)</td>
-<td style="width:100px;">DOUBLE</td>
-</tr>
-</table></td></tr>
-<tr style="vertical-align:top">
-<td><font color="green">r</font></td>
-<td>reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO)</td>
-<td style="width:100px;">STRING</td></tr>
-<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
-<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
-<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
-<tr style="vertical-align:top">
-<td><font color="green">f</font></td>
-<td>filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL)</td>
-<td style="width:100px;">STRING</td>
-</tr>
-</table></td></tr>
-<tr style="vertical-align:top">
-<td><font color="green">outdir</font></td>
-<td>The output directory, defaults to the current working directory (.)</td>
-<td>STRING</td>
-</tr>
-</table>
-'''Example:'''
- java -jar GeMoRNA-1.0.jar Analyzer t=&lt;truth&gt; p=&lt;predicted_annotation&gt;
-=== Merge ===
+== Merge ==
@@ Line 271: / Line 244: @@
 ''Merge'' may be called with
-  java -jar GeMoRNA-1.0.jar merge
+  java -jar GeMoSeq-1.2.3.jar merge
 and has the following parameters
@@ Line 288: / Line 261: @@
 </tr>
 <tr style="vertical-align:top">
-<td><font color="green">GeMoRNA</font></td>
+<td><font color="green">GeMoSeq</font></td>
-<td>GeMoRNA (GeMoRNA predictions, type = gff,gff3)</td>
+<td>GeMoSeq (GeMoSeq predictions, type = gff,gff3)</td>
 <td style="width:100px;">FILE</td>
 </tr>
@@ Line 306: / Line 279: @@
 </tr>
 <tr style="vertical-align:top">
-<td><font color="green">GeMoRNA-strict</font></td>
+<td><font color="green">GeMoSeq-strict</font></td>
-<td>GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3)</td>
+<td>GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)</td>
 <td style="width:100px;">FILE</td>
 </tr>
@@ Line 317: / Line 290: @@
 </tr>
 <tr style="vertical-align:top">
-<td><font color="green">GeMoRNA-strict</font></td>
+<td><font color="green">GeMoSeq-strict</font></td>
-<td>GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3)</td>
+<td>GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)</td>
 <td style="width:100px;">FILE</td>
+</tr>
+<tr style="vertical-align:top">
+<td><font color="green">l</font></td>
+<td>Low-confidence (include low-confidence predictions, default = true)</td>
+<td style="width:100px;">BOOLEAN</td>
 </tr>
 </table></td></tr>
@@ Line 331: / Line 309: @@
 '''Example:'''
-  java -jar GeMoRNA-1.0.jar merge g=&lt;GeMoMa&gt; GeMoRNA=&lt;GeMoRNA&gt;
+  java -jar GeMoSeq-1.2.3.jar merge g=&lt;GeMoMa&gt; GeMoSeq=&lt;GeMoSeq&gt;
+== Version history ==
+* Version 1.2.3 (2025/11/11): Renamed the tool to GeMoSeq and improved prediction from long-read data
+* [http://www.jstacs.de/downloads/GeMoRNA-1.2.1.jar Version 1.2.1] (2025/05/28): improved handling of exceptions in multi-thread mode
+* [http://www.jstacs.de/downloads/GeMoRNA-1.2.jar Version 1.2] (2025/05/12): changes in the following tools
+** gemorna: fixed a problem where (incomplete) CDS would be predicted in transcripts without any proper stop codon
+* [http://www.jstacs.de/downloads/GeMoRNA-1.1.jar Version 1.1] (2025/04/15): changes in the following tools
+** merge: include flag if low-confidence predictions will be included in "annotate" mode
+** gemorna: allow to provide custom prefix for gene names and to include the chromosome into the gene names
+* [http://www.jstacs.de/downloads/GeMoRNA-1.0.jar Version 1.0]: initial version of GeMoRNA