Genomic sequencing has become a critical technique to investigate the SARS-CoV-2 virus. Groups like Nextstrain are gathering thousands of sequences to illuminate where the virus is coming from and where it’s going. Insights from this work can show officials how an outbreak is spreading through their community and inform public health decisions.
But to get there, scientists have to analyze huge datasets, slowing the work. Researchers have been forced to use workarounds, such as subsampling, which analyzes only a portion of the available sequences.
To overcome this logjam, CSE Assistant Teaching Professor Niema Moshiri and Professor Tajana Rosing, the John J. and Susan M. Fratamico Endowed Chair in the Jacobs School of Engineering, recently received a $200,000 National Science Foundation RAPID grant. Their joint project, Real-time phylogenetic inference and transmission cluster analysis of COVID-19, will explore software and hardware to accelerate genomic analyses and keep pace with the sequencing data.
“The United States has more than 18,000 sequences,” said Moshiri. “But even then, we have to subsample because we don't have enough computing power to handle all of them. This motivated Tajana and me to work on this effort because this real-time data might help public health officials get actionable information.”
Two Bottlenecks
To compare viral sequences, their genomes must be perfectly lined up, which is computationally intensive. Doubling the number of sequences quadruples the running time.
Moshiri realized they could accelerate this step by individually aligning samples against a coronavirus reference genome and then merging the resulting alignments. “I developed a software tool that, without the need for any special hardware, speeds it up to require only minutes, rather than days, to run the full data set of 80,000 coronavirus samples,” said Moshiri.
Inferring an evolutionary tree is even more complex, sometimes taking weeks to compute. Rosing is working on customized hardware to speed up the process, testing different graphics processing units (GPUs), field programmable gate arrays (FPGAs) and other custom solutions.
“We have been doing acceleration studies on different types of hardware,” said Rosing. “We’ve also built specialized chips that do processing in memory. If you have a workflow that spends a lot of time moving data from storage, that tends to be pretty easy to accelerate by adding a data analysis capability directly to memory cells.”
The team is also exploring machine learning solutions: deep neural networks, hyperdimensional computing, reinforcement learning and recommender systems. They hope these approaches can reduce analysis time from days to a few moments.
“This means I could take a handheld genomic sequencer, do a little swab anywhere, run it on my chip and get an answer in seconds,” said Rosing. “That completely changes how we detect and analyze the spread of disease, because we can bring it directly to the source instead of waiting for it to spread like crazy, and then trying to do something about it.”
This combination of software and hardware expertise is proving synergistic, with each side contributing to the overall solution. “It’s like Tajana’s lab is providing cheat codes for the algorithm world,” said Moshiri. “If my ideas don't scale the way they need to, her lab can design a custom part to make it work.”
As important as this work is to address COVID-19, it will become even more helpful during the next outbreak. Material from thousands of nose swabs could be sequenced, generating an avalanche of data. In addition, Moshiri and Rosing’s research is inspiring larger efforts to prepare for this future.
“The work Niema and I have done has caught the ear of DARPA and the semiconductor industry,” said Rosing. “I just gave a talk on this at the big DARPA Electronics Resurgence Initiative summit.”