Human Pangenome Reference Consortium Releases Data from 30 Genomes

March 3, 2021

NEW YORK – The Human Pangenome Reference Consortium (HPRC) is marking a year of progress by releasing data from 30 genomes assembled so far.

The genomes, available through repositories in the US, Europe, and Japan, include data from Pacific Biosciences' HiFi reads, Oxford Nanopore Technologies' ultralong reads, and Bionano Genomics' optical mapping. Hi-C data for all 30 samples generated with Dovetail Genomics' Omni-C kits will be released this month. In addition, single-cell template strand sequencing data for haplotype phasing are available for seven samples, as are Illumina NGS data from 60 parents used in trio phasing. The group plans to also include 12 additional genomes from collaborators at Washington University in St. Louis, the National Human Genome Research Institute, and the University of California Santa Cruz.

At this year's virtual Advances in Genome Biology and Technology meeting, Karen Miga, a research scientist at the UCSC, said the consortium was using these data to create "incredibly high-quality phased assemblies" by using Hifiasm, a new de novo assembly algorithm published last month in Nature Methods by researchers led by Harvard University's Heng Li. "We found tremendous success in not only continuity or N50 and phase blocks but also in the quality of these assemblies themselves," Miga said.

A benchmark genome had 519 contigs with an NG50 of 43 Mb, phase-block NG50 of 18 Mb, a Q54 score, and a heterozygous SNP sensitivity of 99.3 percent. Overall, the diploid assemblies of the 30 genomes had an N50 between 18 and 59 Mb and Q scores between 50 and 56, she said.

Using these assemblies, the consortium is developing new methods for automating quality control protocols and scaffolding, Miga added. She also provided a preview of methods that will be coming to the pan-genome project that are being developed as part of the Telomere-to-Telomere (T2T) Consortium.

Launched in 2019 with $29.5 million in funding from the National Human Genome Research Institute (NHGRI), the pan-genome project seeks to present a more complete representation of human genomes, capturing the diversity of variants that exist in the population. Among its goals is to produce hundreds of high-quality human genome assemblies, covering each chromosome from telomere to telomere.

Its aims and participants often overlap with the T2T Consortium, which Miga co-leads with NHGRI bioinformatician Adam Phillippy.

Finding ways to map and assemble so far intractable regions of the genome, such as centromeres and highly repetitive regions, has been a key aim of these projects. Miga noted that the HPRC is reliant on recent advances in long-read sequencing technology, namely PacBio's HiFi reads and Oxford Nanopore Technologies' ultra-long reads.

"Our team is reaching 35X to 40X coverage of greater than Q20 HiFi reads in the range of 18 to 20 kb for this particular project," she said. HiFi output for all but one sample topped 100 Gb.

The consortium is also obtaining reads of 100 kb or more at approximately 6X coverage from Oxford Nanopore's ultra-long reads — about 10 percent of all the nanopore reads; Hi-C coverage of 60X, and BioNano optical maps with an N50 of about 250 kb at about 100X coverage.

The data release includes 60 Illumina NGS parental datasets used in trio phasing, at 30X coverage using 150 bp paired-end sequencing.

The data can be accessed through repositories with the National Center for Biotechnology Information, the European Bioinformatics Institute's European Nucleotide Archive (ENA), and the DNA Data Bank of Japan (DDBJ).

Miga also shared progress from the T2T consortium. "A lot of the technologies we have been developing are ones that are going to be brought over" into the work on the human pan-genome, she said. In September 2020, the group released the complete sequence of a hydatidiform mole genome that featured zero unlocalized or unplaced contigs. It had a Q70 score and introduced between 100 Mb and 190 Mb of new sequence, compared to the GRCh38 reference genome.

In addition to chromosome assemblies using HiFi data and Oxford Nanopore ultra-long reads, the consortium is producing the first high-resolution maps of all acrocentric chromosome short arms as well as every pericentric and centromeric region in the genome. Fluorescence in situ hybridization is being integrated as "a nice companion orthogonal method to show copy number," Miga said. Group members are also making progress on mapping genomic rearrangements and segmental duplications, identifying new repeats, and even finding genes buried in centromeric regions.

"We're not yet to the finish line," Miga cautioned. The recently released T2T genome is essentially haploid and there exists a "real technological barrier to reach the next milestone of a diploid T2T genome," she said, not to mention the difficulty of doing hundreds of those for the human pan-genome effort.

More info at: https://www.genomeweb.com/sequencing/human-pangenome-reference-consortium-releases-data-30-genomes#.YEr_Jp0zY2w

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.