Enhancing the foundation of genomic research


The key to understanding heredity, disease, and evolution lies in the genome, which is encoded in nucleotides (i.e., the bases A, T, G, and C). DNA sequencers can read these nucleotides, but doing so both accurately and at scale is challenging, due to the very small scale of the base pairs. However, to unlock the secrets hidden within the genome, we must be able to assemble a reference genome as close to perfect as possible.

Errors in assembly can limit the methods used to identify genes and proteins, and can cause later diagnostic processes to miss disease-causing variants. In genome assembly, the same genome is sequenced many times, allowing iterative correction of errors. Still, with the human genome being 3 billion nucleotides, even a small error rate can mean a large total number of errors and can limit the derived genome’s utility.

In an effort to continually improve the resources for genome assembly, we introduce DeepPolisher, an open-source method for genome assembly that we developed in a collaboration with the UC Santa Cruz Genomics Institute. In our recent paper, “Highly accurate assembly polishing with DeepPolisher”, published in Genome Research, we describe how this pipeline extends existing methods to improve the accuracy of the genome assembly. DeepPolisher reduces the number of errors in the assembly by 50% and the number of insertion or deletion (“indel”) errors by 70%. This is especially important since indel errors interfere with the identification of genes.

Leave a Reply

Your email address will not be published. Required fields are marked *