New Genome Assembler Simplifies Viral Sequencing

Jun 1, 2020
Professor Pavel Pevzner

Computer scientists at UC San Diego and Saint Petersburg State University have developed a new approach to genome assembly that will help scientists identify new viruses in complex samples. The technique, called metaviralSPAdes, allows researchers and clinicians to delineate a single viral genome, even when it is mixed with thousands of others viruses and bacteria. The study was recently published in the journal Bioinformatics.

“When a new virus emerges, biologists rush to reconstruct its genome, a prerequisite for future diagnostics and vaccine development,” said Pavel Pevzner, Ronald R. Taylor Professor of Computer Science in the Computer Science and Engineering Department at UC San Diego and senior author on the paper. “The challenge with viral sequencing is that a sample from a patient, like the saliva from the COVID-19 patient used to assemble the first SARS-COV-2 genome, contains genomes from many other viruses. There may also be hundreds, or even thousands, of bacterial genomes. This background noise can make it difficult to identify the viral genomes among them.”

When genomic sequencers “read” a genome, it’s not a linear process, like reading a book. Instead, these instruments break them up into small parts, read each snippet separately and assemble a functional genome. However, patient samples can contain many viruses and bacteria. Metagenome (multiple genome) assemblies can include snippets from all of them, muddying the results. 

“Imagine buying a bunch of puzzles, mixing the pieces together and trying to assemble them all,” said Pavel Pevzner, “That’s the problem scientists face when trying to decode a new viral genome after taking a sample from a patient that got sick with a previously unknown disease.”

The challenge for bioinformaticians is to ensure the final genomic assembly includes the unknown virus they’re trying to identify, making sure this sequence is not lost among the other genomes. This can be particularly challenging when sequencing a new pathogen, like the novel coronavirus, without a reference genome.

The new algorithm excels at metavirome assembly: identifying viral snippets hidden among much longer bacterial sequences and stitching them together into a complete genome. Once that step is complete, scientists can more accurately identify the pathogen.

MetaviralSPAdes builds on years of genome assembly work, starting with the original SPAdes algorithm, developed jointly with students at Saint Petersburg State University when Pevzner was on sabbatical in Russia in 2012. 

The Pevzner lab has been continuously refining this approach, developing several variations over the years to improve genome assembly. As a result, SPAdes is now the most widely use assembler in the world and has been cited in nearly 9000 papers. 

While metaviralSPAdes came a little late to assemble the coronavirus sequence, it will be quite useful to detect – and hopefully help mitigate – future pathogenic viruses. Pevzner foresees intensive, surveillance efforts, as scientists seek out potentially dangerous viruses in animals before they cross into humans.

“The COVID-19 pandemic is a wake-up call for biologists studying viral transmission from animals to humans,” said Pevzner. “We must recognize the importance of ongoing viral surveillance, such as collecting samples from animals and studying this huge repertoire of viruses before they move to humans and trigger the next outbreak.”