New Algorithm Supercharges Long Read DNA Assembly

Apr 1, 2019

By Joshua Baxt

Researchers at UC San Diego have discovered a better way to assemble long DNA reads into genomes. Called Flye, the new algorithm is faster and more accurate than existing assembly methods. Given the increasing popularity and potential benefits of long reads, this new approach could dramatically accelerate a wide range of sequencing projects. The study was published in the journal Nature Biotechnology.

“Fragment assembly is not unlike putting together a puzzle from a billion pieces, and it is often viewed as one of the most difficult algorithmic problems in bioinformatics,” said Pavel Pevzner, Ronald R. Taylor Professor of Computer Science and senior author on the paper. “Genomes such as ours are very repetitive, like a jigsaw with many identically shaped pieces, adding extra complexity to the problem – try assembling a puzzle where half the pieces represent blue sky.”

The most prevalent sequencing approaches, popularized by Illumina, rely on short reads, which are around 200 nucleotides long. Pevzner and colleagues developed a short-read assembly algorithm, called SPAdes, which is widely used in both research and industry.

However, short reads have an inherent limitation: they generate fragmented assemblies because the reads are too short to bridge genomic repeats. As a result, each chromosome is assembled into thousands of fragments rather than a contiguous string in the four-letter nucleotide alphabet. This fragmentation limits biomedical applications.

In recent years, companies like Pacific Biosciences and Oxford Nanopore have advanced long reads, which are better at assembling genomes into entire chromosomes.

“People are gravitating to long reads,” said Pevzner. “It’s a more expensive technology, but it provides more contiguous assembly. Our SPAdes tool dominates short read assembly, but long reads are a new entrance for us.”

While long reads are 100 times longer than short ones, they are also 100 times less accurate, turning assembly into a difficult puzzle and slowing down research.

Pavel-Pevzner-and-Mikhail-Kolmogorov-(4-of-6)_0.jpg

To solve the long read problem, Pevzner’s team returned to their roots. Twenty years ago, the group introduced de Bruijn graphs for short-read assembly. These are based on the “Seven Bridges of Konigsberg” puzzle, which asks participants to find a path through the middle age city of Konigsberg while walking across each bridge only once. Pevzner modeled genome assembly as a giant city, in which each read represents a bridge and a genome represents a path visiting each bridge.

any were dubious De Bruijn graphs could be applied to long reads because they are error-prone. The most widely-used long read assembler, called Canu and developed by the National Institutes of Health, does not use de Bruijn graphs, but rather an upgraded version of an algorithm developed 20 years ago for the human genome assembly.

In the paper, Pevzner compares Flye to five other assemblers, including Canu, on genomes ranging from bacteria to human. Throughout the study, Flye consistently outperformed the other assemblers, improving both speed and accuracy.

“Our de Brujin graph assembler improves on the existing approaches for long reads in contiguity, accuracy and speed,” said UC San Diego graduate student, Mikhail Kolmogorov, first author of the study. “Flye is an order of magnitude faster than Canu. The speed is particularly important in an era when the cost of computing time to assemble a genome is measured in thousands of dollars and may soon exceed the cost of generating the experimental data.”

Other authors on the study were Jeffrey Yuan and Yu Lin.