-
[
IEEE/ACM Trans Comput Biol Bioinform,
2017]
De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic "contigs"-the output of assembly. These steps of graph construction and traversal, contribute to well over 90% of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler-one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality tradeoffs, when compared against many state-of-the-art genome assemblers.
-
[
Bioinformatics,
2015]
UNLABELLED: We introduce FinisherSC, a repeat-aware and scalable tool for upgrading de novo assembly using long reads. Experiments with real data suggest that FinisherSC can provide longer and higher quality contigs than existing tools while maintaining high concordance. AVAILABILITY AND IMPLEMENTATION: The tool and data are available and will be maintained at
http://kakitone.github.io/finishingTool/ CONTACT: : dntse@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
-
Ogura Y, Okuno M, Kohara Y, Hayashi T, Toshimoto K, Fujiyama A, Maruyama H, Harada M, Nagayasu E, Noguchi H, Itoh T, Yabana M, Kajitani R, Toyoda A
[
Genome Res,
2014]
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.
-
[
Genome Res,
2017]
Advances in long read single molecule sequencing have opened new possibilities for 'benchtop' whole genome sequencing. The Oxford Nanopore Technologies MinION is a portable device that uses nanopore technology that can directly sequence DNA molecules. MinION single molecule long sequence reads are well suited for de novo assembly of complex genomes as they facilitate the construction of highly contiguous physical genome maps obviating the need for labor-intensive physical genome mapping. Long sequence reads can also be used to delineate complex chromosomal rearrangements, such as those that occur in tumour cells, that can confound analysis using short reads. Here, we assessed MinION long read-derived sequences for feasibility concerning: 1) the de novo assembly of a large complex genome and 2) the elucidation of complex rearrangements. The genomes of two Caenorhabditis elegans strains, a wild type strain and a strain containing two complex rearrangements were sequenced with MinION. Up to 42-fold coverage was obtained from a single flowcell and the best pooled data assembly produced a highly contiguous wild type C. elegans genome containing 48 contigs (N50 contig length = 3.99 Mb) covering >99% of the 100,286,401 base reference genome. Further, the MinION-derived genome assembly expanded the C. elegans reference genome by >2Mb due to a more accurate determination of repetitive sequence elements, and assembled the complete genomes of two co-extracted bacteria. MinION long read sequence data also facilitated the elucidation of complex rearrangements in a mutagenized strain. The sequence accuracy of the MinION long read contigs (~98%) was improved using Illumina-derived sequence data to polish the final genome assembly to 99.8% nucleotide accuracy when compared to the reference assembly.
-
[
Nucleic Acids Res,
2004]
Pristionchus pacificus is a free-living nematode of the Diplogastridae family and was recently developed as a satellite system in evolutionary developmental biology. AppaDB, a P.pacificus database, was created
(http://appadb.eb.tuebingen. mpg.de) to integrate the genomic data of P.pacificus, comprising the physical map, genetic linkage map, EST and BAC end sequence and hybridization data. This developing database serves as a repository to search and find any information regarding physical contigs or genetic markers required for mapping of mutants. Additionally, it provides a platform for the Caenorhabditis elegans community to compare nematode genetic data in an evolutionary perspective.
-
[
Genome Res,
2019]
Long-read sequencing technologies have contributed greatly to comparative genomics among species and can also be applied to study genomics within a species. In this study, to determine how substantial genomic changes are generated and tolerated within a species, we sequenced a <i>C. elegans</i> strain, CB4856, which is one of the most genetically divergent strains compared to the N2 reference strain. For this comparison, we used the Pacific Biosciences (PacBio) RSII platform (80x, N50 read length 11.8 kb) and generated de novo genome assembly to the level of pseudochromosomes containing 76 contigs (N50 contig = 2.8 Mb). We identified structural variations that affected as many as 2694 genes, most of which are at chromosome arms. Subtelomeric regions contained the most extensive genomic rearrangements, which even created new subtelomeres in some cases. The subtelomere structure of Chromosome VR implies that ancestral telomere damage was repaired by alternative lengthening of telomeres even in the presence of a functional telomerase gene and that a new subtelomere was formed by break-induced replication. Our study demonstrates that substantial genomic changes including structural variations and new subtelomeres can be tolerated within a species, and that these changes may accumulate genetic diversity within a species.
-
[
Theor Appl Genet,
2002]
An insertion-sequence of prokaryotic origin was detected in a genomic clone obtained from a Phaseolus vulgaris bacterial artificial chromosome (BAC) library. This BAC clone, characterized as part of a contig constructed near a virus resistance gene, exhibited restriction fragment length polymorphism with an overlapping clone of the contig. Restriction analysis of DNA obtained from individual colonies of the stock culture indicated the presence of a mixed population of wild-type and insertional mutants. Sequence analysis of both members of the population revealed the presence of IS 10R, an insertion-sequence from Escherichia coli. A BLAST search for IS 10-like sequences detected unexpected homologies with a large number of eukaryotic sequences from Homo sapiens, Arabidopsis thaliana, Drosophila melanogasterand Caenorhabditis elegans. Southern analysis of a random sample of BAC clones failed to detect IS 10 in the BAC DNA. However, prolonged sub-culturing of a set of 15 clones resulted in transposition into the BAC DNA. Eventually, all cultures acquired a 2.3-kb fragment that hybridized strongly with IS 10. Sequence analysis revealed the presence of a preferred site for transposition in the BAC vector. These results indicate that a large number, if not all, of the BAC libraries from different organisms are contaminated with IS 10R. The source of this element has been identified as the DH10B strain of E. coli used as the host for BAC libraries.
-
[
Pathog Dis,
2017]
The draft genome assembly of the Wolbachia endosymbiont of Wuchereria bancrofti (wWb) genome consists of 1,060,850 bp in 100 contigs and contains 961 ORFs, with a single copy of the 5S rRNA, 16S rRNA, and 23S rRNA and each of the 34 tRNA genes. Phylogenetic core genome analyses show wWb to cluster with other strains in supergroup D of the Wolbachia phylogeny, while being most closely related to the Wolbachia endosymbiont of Brugia malayi strain TRS (wBm). The wWb and wBm genomes share 779 orthologous clusters with wWb having 101 unclustered genes and wBm having 23 unclustered genes. The higher number of unclustered genes in the wWb genome likely reflects the fragmentation of the draft genome.
-
Buchberg AM, Ikeda J, Minoguchi S, Takahashi Y, Honjo T, Habu S, Kurooka H, Moriwaki K, Shisa H, Kato K, Osawa N
[
Genomics,
1997]
In a yeast artificial chromosome contig close to the nude locus on mouse chromosome 11, we identified a novel gene, nucleoredoxin, that encodes a protein with similarity to the active site of thioredoxins. Nucleoredoxin is conserved between mammalian species, and two homologous genes were found in Caenorhabditis elegans. The nucleoredoxin transcripts are expressed in all adult tissues examined, but restricted to the nervous system and the limb buds in Day 10.5-11.5 embryos. The nucleoredoxin protein is predominantly localized in the nucleus of cells transfected with the nucleoredoxin expression construct. Since the bacterially expressed protein of nucleoredoxin showed oxidoreductase activity of the insulin disulfide bonds with kinetics similar to that of thioredoxin, it may be a redox regulator of the nuclear proteins, such as transcription factors.
-
[
Infect Genet Evol,
2010]
Trichostrongylus colubriformis (Strongylida), a small intestinal nematode of small ruminants, is a major cause of production and economic losses in many countries. The aims of the present study were to define the transcriptome of the adult stage of T. colubriformis, using 454 sequencing technology and bioinformatic analyses, and to predict the main pathways that key groups of molecules are linked to in this nematode. A total of 21,259 contigs were assembled from the sequence data produced from a normalized cDNA library; 7876 of these contigs had known orthologues in the free-living nematode Caenorhabditis elegans, and encoded, amongst others, proteins with 'transthyretin-like' (8.8%), 'RNA recognition' (8.4%) and 'metridin-like ShK toxin' (7.6%) motifs. Bioinformatic analyses inferred that relatively high proportions of the C. elegans homologues are involved in biological pathways linked to 'peptidases' (4%), 'ribosome' (3.6%) and 'oxidative phosphorylation' (3%). Highly represented were peptides predicted to be associated with the nervous system, digestion of host proteins or inhibition of host proteases. Probabilistic functional gene networking of the complement of C. elegans orthologues (n=2126) assigned significance to particular subsets of molecules, such as protein kinases and serine/threonine phosphatases. The present study represents the first, comprehensive insight into the transcriptome of adult T. colubriformis, which provides a foundation for fundamental studies of the molecular biology and biochemistry of this parasitic nematode as well as prospects for identifying targets for novel nematocides. Future investigations should focus on comparing the transcriptomes of different developmental stages, both genders and various tissues of this parasitic nematode for the prediction of essential genes/gene products that are specific to nematodes.