Frequently asked questions:
===========================

1) Why does the Extractor not return a single CDS-part, protein, ...?
First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) 
Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry.
Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.

2) How can I force GeMoMa to make more predictions?
There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct).
For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.

3) Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?
By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:
	* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").
	* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).

4) Is it mandatory to use RNA-seq data?
No, GeMoMa is able to make predictions with and without RNA-seq evidence.

5) Is it possible to use multiple reference organisms?
It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.

6) Why do some reference genes not lead to a prediction in the target genome?
Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).
If the genes have been discarded, there are two possibilities:
	* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.
	* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.
If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:
	* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").
	* GeMoMa simply did not find a prediction matching the remaining quality criteria.

7) For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?
GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).

8) A lot of transcripts have been filtered out by the Extractor. What can I do?
There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.

9) Is GeMoMa able to predict pseudo-genes/ncRNA?
No, currently not.

10) My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?
GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.
Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.

11) My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?
GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.


For any further questions or comments please contact jens.keilwagen@julius-kuehn.de 