[
International Worm Meeting,
2021]
Evaluation and interpretation of large data sets requires tools to identify statistically enriched characteristics of the data and visualize the results. We developed WormCat as a web-based tool to find categories of enriched genes in RNA sequencing (RNA seq) data and to produce graphs that allowed for comparison of multiple datasets in an intuitive manner. This tool utilizes a grouping of nested categories to annotate nearly all C. elegans genes. Annotation is based first on physiological function, and if that can not be assigned, then each gene is assigned a molecular or cell-location based category. If genes do not have a predicted function, they are placed in an "Unassigned" category. This is in contrast to commonly used Gene Ontology (GO) servers which commonly exclude around 30% of C. elegans genes that lack GO annotations. This alters category enrichment statistics by eliminating genes of unknown function relative to genes that are functionally annotated. We previously found that WormCat identified enriched gene sets not predicted by commonly used GO servers. In WormCat 2.0 we have enhanced the capabilities of the web-based tool, updated the annotation list and included an annotation list specific to protein-coding genes for proteomics searches. We have also performed validation of additional datasets comparing inclusion or exclusion of the "Unassigned" genes. First, we found that while the categories with the highest enrichment values were not changed by the exclusion of the "Unassigned" genes, enrichment scores closer to the significance threshold were lost. Enrichment tools such as WormCat are used to predict gene sets of interest for further experimental analysis. Thus, including all genes in the hypothesis space, regardless of our ability to functionally annotate them, is important to provide the most appropriate enrichment scores. Finally, in the analysis of published tissue-specific RNA seq data sets, we found that enrichment in "Unassigned" genes was not uniformly distributed among tissues. This suggests that identification of these genes sets through enrichment scoring may stimulate exploration of their function.