In a paper that brings scientists measurably closer to assembling the entire human genome, UC San Diego Department of Computer Science and Engineering Professor Pavel Pevzner has outlined an algorithm, called centroFlye, that uses long, error-prone DNA reads to assemble centromeres, the DNA that connects chromosome arms. This is the first time an accurate centromere sequence has been automatically assembled. The paper was co-authored with graduate student Andrey Bzikadze and published this week in Nature Biotechnology.
Though quite comprehensive, the first draft of the human genome had many missing sequences. Centromeres were the largest of these gaps. Working with data produced by the Telomere-to-Telomere (T2T) Consortium, Pevzner and Bzikadze have developed an approach that could close these gaps.
In addition to their Nature Biotechnology work, Bzikadze and Pevzner contributed to a landmark paper, also published this week, in the journal Nature that reported the first-ever complete assembly of a human chromosome. CentroFlye played an important role in this work.
“Human centromeres have remained the dark matter of the human genome, evading all attempts to sequence them since the Human Genome Project was completed,” says Pevzner, the Ronald R. Taylor Professor of Computer Science and senior author on the paper. “This is the first automated way to assemble centromeres. Now, we have to generate the first gapless assembly of the human genome.”
Centromeres make up around 3 percent of the human genome and are thought to play important roles in human health. However, without accurate sequences, it remains challenging to precisely assess how they contribute to disease.
“Centromeres are associated with various diseases, including cancer, and maybe there are more, but we know so little about them,” says Bzikadze. “These assemblies will allow us to systematically study variations in centromeres and their associations with disease.”
To uncover the secrets of human centromere DNA sequences, Pevzner modified his long read assembly algorithm, called Flye. Based on the Seven Bridges of Konigsberg puzzle, in which participants traverse the city while walking across each bridge only once, Flye models genome assembly as a large city. Each read is a bridge and the genome represents a path traversing each bridge. However, centromere sequences are highly repetitive, a challenge that dogged previous efforts.
“They’re like a jigsaw puzzle on steroids where most of the puzzle is a blue sky with some clouds” says Pevzner. “How do you assemble blue sky?”
Although the centromeres in each human chromosome are different, they are all formed by segments (called higher-order repeats or HORs), which can be repeated thousands of times with little variation.
Since HORs are so repetitive, almost all short centromere substrings (called k-mers for strings of length k) are repeated many times, turning the assembly into a computational nightmare – not unlike assembling a puzzle with just 16 pieces where each frog appears just four times (figure below). However, Bzikadze and Pevzner found rare k-mers in the centromere (i.e, k-mers that appear only once) that become anchors for assembly.
These rare k-mers are like small, wispy clouds in that flawless blue sky, providing small bits of contrast, which the researchers used to guide the assembly algorithm.
In addition to providing new insights into disease, sequencing centromeres could deliver a wealth of information about human biology, such as illuminating how these structures evolved, how they are maintained and why they have such different sequences, even between the 23 human chromosomes.
“Centromeres hold many biological secrets that are waiting to be solved,” says Pevzner. “The time has come, 20 years after the completion of the Human Genome Project, to read the compete genome.”