-
[
J Am Soc Mass Spectrom,
2015]
De novo sequencing software has been widely used in proteomics to sequence new peptides from tandem mass spectrometry data. This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today's peptide de novo sequencing analyses. To improve the accuracy, Novor's scoring functions are based on two large decision trees built from a peptide spectral library with more than 300,000 spectra with machine learning. Important knowledge about peptide fragmentation is extracted automatically from the library and incorporated into the scoring functions. The decision tree model also enables efficient score calculation and contributes to the speed improvement. To further improve the speed, a two-stage algorithmic approach, namely dynamic programming and refinement, is used. The software program was also carefully optimized. On the testing datasets, Novor sequenced 7%-37% more correct residues than the state-of-the-art de novo sequencing tool, PEAKS, while being an order of magnitude faster. Novor can de novo sequence more than 300 MS/MS spectra per second on a laptop computer. The speed surpasses the acquisition speed of today's mass spectrometer and, therefore, opens a new possibility to de novo sequence in real time while the spectrometer is acquiring the spectral data. Graphical Abstract .
-
[
PLoS Genet,
2017]
Density-Enhanced Phosphatase-1 (DEP-1) de-phosphorylates various growth factor receptors and adhesion proteins to regulate cell proliferation, adhesion and migration. Moreover,
dep-1/scc1 mutations have been detected in various types of human cancers, indicating a broad tumor suppressor activity. During C. elegans development, DEP-1 mediates binary cell fate decisions by negatively regulating EGFR signaling. Using a substrate-trapping DEP-1 mutant in a proteomics approach, we have identified the C. elegans -integrin subunit PAT-3 as a specific DEP-1 substrate. DEP-1 selectively de-phosphorylates tyrosine 792 in the membrane-proximal NPXY motif to promote integrin activation via talin recruitment. The non-phosphorylatable -integrin mutant
pat-3(Y792F) partially suppresses the hyperactive EGFR signaling phenotype caused by loss of
dep-1 function. Thus, DEP-1 attenuates EGFR signaling in part by de-phosphorylating Y792 in the -integrin cytoplasmic tail, besides the direct de-phosphorylation of the EGFR. Furthermore, in vivo FRAP analysis indicates that the -integrin/talin complex attenuates EGFR signaling by restricting receptor mobility on the basolateral plasma membrane. We propose that DEP-1 regulates EGFR signaling via two parallel mechanisms, by direct receptor de-phosphorylation and by restricting receptor mobility through -integrin activation.
-
Stegmann APA, Bonati MT, Panis B, Smith-Hicks C, Lemke JR, Pepler A, Wilson C, Iascone M, McWalter K, Brasington C, Allen W, Di Donato N, Platzer K, Ramos L, Edwards SL, Jamra R, Gamble CN, Mandel H, Stobe P, Mahida S, Marquardt T, Demmer LA, Miller KG, Falik-Zaccai T, Pinz H, Hellenbroich Y, Sticht H, Kok F, Cho MT, Stumpel CTRM, Shinde DN, Angione KM
[
Am J Hum Genet,
2018]
Using exome sequencing, we have identified de novo variants in MAPK8IP3 in 13 unrelated individuals presenting with an overlapping phenotype of mild to severe intellectual disability. The de novo variants comprise six missense variants, three of which are recurrent, and three truncating variants. Brain anomalies such as perisylvian polymicrogyria, cerebral or cerebellar atrophy, and hypoplasia of the corpus callosum were consistent among individuals harboring recurrent de novo missense variants. MAPK8IP3 has been shown to be involved in the retrograde axonal-transport machinery, but many of its specific functions are yet to be elucidated. Using the CRISPR-Cas9 system to target six conserved amino acid positions in Caenorhabditis elegans, we found that two of the six investigated human alterations led to a significantly elevated density of axonal lysosomes, and five variants were associated with adverse locomotion. Reverse-engineering normalized the observed adverse effects back to wild-type levels. Combining genetic, phenotypic, and functional findings, as well as the significant enrichment of de novo variants in MAPK8IP3 within our total cohort of 27,232 individuals who underwent exome sequencing, we implicate de novo variants in MAPK8IP3 as a cause of a neurodevelopmental disorder with intellectual disability and variable brain anomalies.
-
[
BMC Bioinformatics,
2015]
BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS: We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS: LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at
http://gatb.inria.fr/software/leon/. -
[
Sci China Life Sci,
2019]
Orphan genes that lack detectable homologues in other lineages could contribute to a variety of biological functions. However, their origination and function mechanisms remain largely unknown. Herein, through a comprehensive and systematic computational pipeline, we identified 893 orphan genes in the lineage of C. elegans, of which only a low fraction (0.9%) were derived from transposon elements. Six new protein-coding genes that de novo originated from non-coding DNA sequences in the genome of C. elegans were also identified. The authenticity and functionality of these orphan genes and de novo genes are supported by three lines of evidences, consisting of transcriptional data, and in silico proteomic data, and the fixation status data in wild populations. Orphan genes and de novo genes exhibited simple gene structures, such as, short in protein length, of fewer exons, and are frequently X-linked. RNA-seq data analysis showed these orphan genes are enriched with expression in embryo development and gonad, and their potential function in early development was further supported by gene ontology enrichment analysis results. Meanwhile, de novo genes were found to be with significant expression in gonad, and functional enrichment analysis of the co-expression genes of these de novo genes suggested they may be functionally involved in signaling transduction pathway and metabolism process. Our results presented the first systematic evidence on the evolution of orphan genes and de novo origin of genes in nematodes and their impacts on the functional and phenotypic evolution, and thus could shed new light on our appreciation of the importance of these new genes.
-
[
Nat Commun,
2023]
High-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we report on GALA (Gap-free long-read Assembly tool), a computational framework for chromosome-based sequencing data separation and de novo assembly implemented through a multi-layer graph that identifies discordances within preliminary assemblies and partitions the data into chromosome-scale scaffolding groups. The subsequent independent assembly of each scaffolding group generates a gap-free assembly likely free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, and even motif analyses to generate gap-free chromosome-scale assemblies. As a proof of principle we de novo assemble the C. elegans genome using combined PacBio and Nanopore sequencing data and a rice cultivar genome using Nanopore sequencing data from publicly available datasets. We also demonstrate the proposed method's applicability with a gap-free assembly of the human genome using PacBio high-fidelity (HiFi) long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.
-
[
Vet Parasitol,
2008]
Strongyloides sp. (Nematoda) are very wide spread small intestinal parasites of vertebrates that can form a facultative free-living generation. Most authors considered all Strongyloides of farm ruminants to belong to the same species, namely Strongyloides papillosus (Wedl, 1856). Here we show that, at least in southern Germany, the predominant Strongyloides found in cattle and the Strongyloides found in sheep belong to separate, genetically isolated populations. While we did find mixed infections in cattle, one form clearly dominated. This variety, in turn, was never found in sheep, indicating that the two forms have different host preferences. We also present molecular tools for distinguishing the two varieties, and an analysis of their phylogenetic relationship with the human parasite Strongyloides stercoralis and the major laboratory model species Strongyloides ratti. Based on our findings we propose that Strongyloides from sheep and the predominant Strongyloides from cattle should be considered separate species as it had already been proposed by [Brumpt, E., 1921. Recherches sur le determinisme des sexes et de l''evolution des Anguillules parasites (Strongyloides). Comptes rendu hebdomadaires des seances et memoires de la Societe de Biologie et de ses filiales 85, 149-152], but was largely ignored by later authors. For nomenclature, we follow [Brumpt, E., 1921. Recherches sur le determinisme des sexes et de l''evolution des Anguillules parasites (Strongyloides). Comptes rendu hebdomadaires des seances et memoires de la Societe de Biologie et de ses filiales 85, 149-152] and use the name S. papillosus for the Strongyloides of sheep and the name Strongyloides vituli for the predominant Strongyloides of cattle.
-
[
IEEE/ACM Trans Comput Biol Bioinform,
2017]
De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic "contigs"-the output of assembly. These steps of graph construction and traversal, contribute to well over 90% of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler-one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality tradeoffs, when compared against many state-of-the-art genome assemblers.
-
Harada M, Maruyama H, Toyoda A, Ogura Y, Noguchi H, Hayashi T, Kajitani R, Fujiyama A, Kohara Y, Nagayasu E, Okuno M, Yabana M, Toshimoto K, Itoh T
[
Genome Res,
2014]
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.
-
[
Curr Biol,
2011]
DNA injected into the Caenorhabditis elegans germline forms extrachromosomal arrays that segregate during cell division [1, 2]. The mechanisms underlying array formation and segregation are not known. Here, we show that extrachromosomal arrays form de novo centromeres at high frequency, providing unique access to a process that occurs with extremely low frequency in other systems [3-8]. De novo centromerized arrays recruit centromeric chromatin and kinetochore proteins and autonomously segregate on the spindle. Live imaging following DNA injection revealed that arrays form after oocyte fertilization via homologous recombination and nonhomologous end-joining. Individual arrays gradually transition from passive inheritance to active segregation during the early embryonic divisions. The heterochromatin protein 1 (HP1) family proteins HPL-1 and HPL-2 are dispensable for de novo centromerization even though arrays become strongly enriched for the heterochromatin-associated H3K9me3 modification over time. Partial inhibition of HP1 family proteins accelerates the acquisition of segregation competence. In addition to reporting the first direct visualization of new centromere formation in living cells, these findings reveal that naked DNA rapidly builds de novo centromeres in C. elegans embryos in an HP1-independent manner and suggest that, rather than being a prerequisite, HP1-dependent heterochromatin antagonizes de novo centromerization.