Theoretical methods for gene finding in genomic DNA significantly accelerate analysis of experimental data by providing almost instant insight into their biological meaning. The danger of the error contamination of databases is increasing as more theoretically annotated genomes are becoming available and genes and proteins annotated by computer are used for subsequent annotation of new genomes. The complete C. elegans genome was annotated with an aid of computer program GeneFinder (P.Green and L. Hillier, unpublished). Assuming that the accuracy of GeneFinder is high we, nevertheless, attempted to get more precise evaluation of this method as a part of our project of creating experimentally verified training and test sets of genes in several eukaryotic genomes. This direction of our work has been started yet in 1996 in a project of using both GeneMark [1] and experimental verification of its prediction in finding exact exon-intron structure of C. elegans
unc89 [2]. Another goal of our project has been improving the accuracy of the gene finding method GeneMark.hmm [3] that was demonstrated to be highly accurate for eukaryotic genome of Arabidopsis thaliana [4]. To assess the accuracy of gene prediction method we have generated a database of experimentally verified genes by matching genomic DNA with recently sequenced mRNA sequences available in GenBank. In C. elegans case this new set is used for comparison with original annotation (GeneFinder) as well as for training and testing of GeneMark.hmm. The results will be given in our presentation. [1] Borodovsky, M and McIninch, J. 1993. GeneMark: gene prediction of both DNA strands. Computers & Chemistry 17:123-133. [2] Benian, G., Tinley, T., Tang, X., and Borodovsky, M. 1996. The Caenorhabditis elegans gene
unc-89, required for muscle M-line assembly, encoded a giant modular protein composed of Ig and signal transduction domains. Journal of Cell Biology 6: 835-848. [3] Lukashin A. and Borodovsky M. 1998, GeneMark.hmm: New solutions for gene finding, Nucl. Acids Res. 26, 1107-1115. [4] Pavy N., Rombauts S., Dehais P., Mathe C., Ramana D.V.V, Leroy P. and Rouze P. Evaluation of gen prediction software using a genomic data set: application to Arabidopsis thaliana sequences, Bioinformatics, 15, 887-899