As more worm genes are sequenced, considerable variability in the characteristics of both exons and introns at the DNA level is becoming apparent. Features of interest include: Length. Lengths of both exons and introns range from a few tens of bases to around 10 kb. Short exons are very common. While the peak of the intron size distribution is still at 50-52 nucleotides, introns with lengths greater than 1 kb are now fairly common. Base composition. Exons are richer in C and G than introns, but the average difference in composition between exons and introns is decreasing as more sequences become available. Long introns often have islands of relatively high C+G content. The dinucleotides AA and TT are, however, much more common than CC or GG in introns. Codon usage. Codon preferences differ significantly between gene farnilies. There is a general trend toward less asymmetrical codon preferences in weaklyexpressed gene families; however, there are also strong codon preference reservals in some gene families, e.g. glp-l -
lin-12. Sites. S' splice-site sequences differ significantly between long and short introns. This variation, first observed in C. elegans, has now been noted in Drosophila and plants, and may be ubiquitous. 3' splice site sequences appear not to vary with intron length. Complexity. Regulatory genes appear to contain larger numbers of potential splice sites, and to have less-asymmetric base compositions between exons and introns, than highly-expressed structural-protein genes. The efficiency of correct splice-site selection may be affected by these features. The variability in these features significantly increases the difficulty of gene identification by sequence analysis. We have used C. elegans sequences extensively for testing our gm automated DNA sequence analysis software. A number of new functions have been incorporated into gm v. 2.0 to provide greater accuracy of predictions in the face of compositional and site variations between genes. These include more accurate site-identification methods that combine consensus matrix and compositional methods and flexible multinucleotide compositional measures. Functions have also been added that allow automated use of partial cDNA data for genomic sequence analysis. The performance of gm v. 2.0 is currently being evaluated on a large set of worm genes.