Is Perfect de novo DNA Assembly Possible

Trey Ideker.jpg
 

Gene Myers

Director & Founding Chair of Systems Biology,
Max Planck Institute for Molecular Cell Biology and Genetics

Monday, October 21, 2019 @ 11:00am
Room 1202, CSE building

 

Abstract:

We are about to enter an era of DNA sequencing where one can, in the near future produce, a de novo reference-quality genome of any living species for 1,000 EU. This ability will revolutionize ecology, evolution, and conservation science and effectively mark the beginning of a new exploration of the natural world.

The technological driver is the advent of long read sequencers such as the PacBio Sequel and Oxford Promethion. The long reads in effect make assembly easier, and one sees corresponding improvements in the continuity of the results, but the underlying algorithms are effectively the same as those first developed 20 years ago, and repetitions at the scale of read length are still an issue. Indeed, truly better assembly requires finding all artifacts in the reads and the resolution of repeat families, topics that I don’t think have received sufficient attention and that are particularly critical issues for long reads.

Therefore, we are developing algorithms that carefully analyze a long read shotgun data set before assembly in an attempt to perfect and haplotype phase them beforehand. This has proved particularly difficult in the face of an 11-13% sequencing error rate. But using a circular consensus protocol, one can effectively start with reads that have only a .5% error rate but are only 15kbp long, versus the 30-40Kbp real length possible at higher error. Solutions to the problems of perfecting reads and resolving repeats are still required, but are substantially easier. An interesting question is which kind of data is better? And can one assembly the data perfectly?

In addition to long reads, technologies such as long molecule restriction maps, droplet micro-well read “clouds”, and Hi-C cross-linked paired reads can be used to disambiguate complex regions of a genome. A long-term research goal that I believe is achievable is to derive a cost effect protocol involving these technologies and associated computer algorithms that can produce perfect, telomere-to-telemore, haplotype reconstructions of every chromosome pair in a genome, de novo.

Bio:

In 2012 Gene Myers joined a growing group of computational biologists in Dresden as the founding director of a new Systems Biology Center built as part of an extension of the Max-Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG). His group focuses on engineering specialized light microscopes for cell biology and analyzing the imagery produced by such microscopes. Previously Gene had been a group leader at the HHMI Janelia Farm Research Campus (JFRC) since its inception in 2005. Gene came to the JFRC from UC Berkeley where he was on the faculty of Computer Science from 2003 to 2005. From 1998 to 2002 he was the Vice President of Informatics Research at Celera Genomics where he and his team determined the sequences of the Drosophila, Human, and Mouse genomes using the whole genome shotgun technique that he advocated in 1996. Prior to that Gene was on the faculty of the University of Arizona for 17 years and he received his Ph.D in Computer Science from the University of Colorado in 1981.

His research interests include the design and analysis of algorithms for problems in computational molecular biology, image analysis of bioimages, and light microscopy with a focus on building models of the cell and cellular systems from imaging data. He is best known for the development of BLAST -- the most widely used tool in bioinformatics, and for the paired-end whole genome shotgun sequencing protocol and the assembler he developed at Celera that delivered the fly, human, and mouse genomes in a three year period. He has also written many seminal papers on the theory of sequence comparison.

He was awarded the IEEE 3rd Millenium Achievement Award in 2000, the Newcomb Cleveland Best Paper in Science award in 2001, and the ACM Kanellakis Prize in 2002. He was voted the most influential in bioinformatics in 2001 by Genome Technology Magazine and was elected to the National Academy of Engineering in 2003. In 2004 he won the International Max-Planck Research Prize and in 2006 Gene was inducted into Leopoldina, the German Academy of Science and awarded an honorary doctorate at ETH, Zurich. In 2013, Gene was selected to give the Linnaeus Lecture at Uppsala University and was awarded the ISCB senior scientist award in 2014. He was elected to EMBO in 2016 and will receive the Royal Society’s Milner award later this year.