Frequently asked questions:
===========================

1) Why does the Extractor not return a single CDS-part, protein, ...?
Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.

2) How can I force GeMoMa to make more predictions?
There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct).
For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.

3) Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?
By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:
	* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").
	* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).

4) Is it mandatory to use RNA-seq data?
No, GeMoMa is able to make predictions with and without RNA-seq evidence.

5) Is it possible to use multiple reference organisms?
It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.

6) Why do some reference genes not lead to a prediction in the target genome?
Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).
If the genes have been discarded, there are two possibilities:
	* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.
	* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.
If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:
	* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").
	* GeMoMa simply did not find a prediction matching the remaining quality criteria.

7) What does "partial gene model" mean in the context of GeMoMa?
We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.

8) For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?
GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).

9) A lot of transcripts have been filtered out by the Extractor. What can I do?
There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.

10) Is GeMoMa able to predict pseudo-genes/ncRNA?
No, currently not.

11) My RNA-seq data indicates there is an additional intron in a transcript, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?
GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.
Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.

12) My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?
GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.

13) Does GeMoMa predict multiple transcripts per gene?
GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. 

14) GeMoMa failed with java.lang.OutOfMemoryError. What can I do?
Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if youre providing a large coverage file (extracted from RNA-seq data). If you dont have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.7) or "protein alignment" (since version 1.7). Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. 

15) I need to specify the genetic code for my organisms. What is the expected format?
The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:
https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt
Alternative genetic codes are described here using the RNA alphabet:
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.

16) I like to accelerate GeMoMa. What can I do?
You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.
In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.
If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.   

17) Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?
There at least two ways to do this. If you use GeMoMaPipeline you can 
(A)	Use the parameter selected to select specific gene models (=transcripts/proteins) from the annotation instead of using all or
(B)	Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.
Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.

18) Can I determine synteny based on GeMoMa predictions?
Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. 

19) How, can I add additional attributes to the annotation?
Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. 

20) Can structural gene annotation provided by GeMoMa be submitted to NCBI?
Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. 

21) Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?
Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. 
If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.
If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.
A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime. 

For any further questions or comments please contact jens.keilwagen@julius-kuehn.de 