TECHNOLOGY

We work with industry partners to improve high-resolution genome sequencing workflows and computation pipelines at low costs.

 
 
VGP_Technology.png

A major challenge for saving species and conducting high-quality genomic research is that most current genome assemblies have hundreds of thousands to millions of errors in them.

Parts of genes are missing, some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. Due to these fragmented, error-prone assemblies, researchers have had to clone, re-sequence, and correct individual genes. In some cases, the gene structures are too complex. In many other instances, investigators do not even know that they are working with incorrect gene sequences and structures, impacting many scientific findings and scientific progress. Thus, high-quality error-free genome assemblies and annotations are necessary.

To reduce these errors, we worked with industry partners to develop unprecedented high-resolution genome sequencing methods at significantly lower costs than current, less robust technologies. Some of the industry and academic partners thus far from 2015 to 2018 have been the major sequencing and assembly companies (e.g., Illumina, Pacific Biosciences, Oxford Nanopore, Bionano Genomics, 10X Genomics, NRGene, Dovetail Genomics, Phase Genomics, Arima Genomics), sequencing centers (BGI, Broad Institute, Sanger Institute, Washington University Genome Center), public genome archive and annotation centers (NCBI, Ensembl, UCSC), and experts in academia and government (NIH, NSF) to test, improve, and generate new approaches for producing the highest quality, near error-free, 3rd generation reference genome assemblies achievable for the least cost possible. For the first time, we tested each of these technologies on the same animals, a bird (hummingbird) and a mammal (goat or human), such that our analyses were not hampered by the common problem of multiple variables changing simultaneously.

EXAMPLE 1

  IMAGE A  is a representation of 1st and 2nd generation genome sequencing methods; puzzle pieces are smaller to represent the short reads, and the holes represent missing information.  IMAGE   B  is a representation of the 3rd generation technologies that we are using for the VGP, where puzzles pieces are larger to represent long reads.

IMAGE A is a representation of 1st and 2nd generation genome sequencing methods; puzzle pieces are smaller to represent the short reads, and the holes represent missing information. IMAGE B is a representation of the 3rd generation technologies that we are using for the VGP, where puzzles pieces are larger to represent long reads.

EXAMPLE 2

  IMAGE A  corresponds to 1st and 2nd generation genome sequencing methods; the short reads from these methods are being approximated to the height of a human.  IMAGE   B  corresponds to the 3rd generation technologies that we are using for the VGP; the long reads produced with these 3rd generation technologies are being approximated to the height of the Empire State Building and halfway to the moon.

IMAGE A corresponds to 1st and 2nd generation genome sequencing methods; the short reads from these methods are being approximated to the height of a human. IMAGE B corresponds to the 3rd generation technologies that we are using for the VGP; the long reads produced with these 3rd generation technologies are being approximated to the height of the Empire State Building and halfway to the moon.

iStock-673282272_Sm.jpg

Algorithm Development

For our VGP genomes, the G10K set a minimum genome quality metric: contig N50 of 1 million bp (1Mb), scaffold N50 of 10Mb, 90% of the genome assembled into chromosomes confirmed by 2 independent sources, a base-call quality error of no less than QV40 (no more than 1 nucleotide error in 10,000 bp), and haplotype phased. We call this a 3.4.2.QV40 PHASED METRIC, where the first three numbers are the exponents of the N50 contig, N50 scaffold, and level of chromosomal assembly. We will sequence the heterogametic sex (when it exists) so that both sex chromosomes can be assembled for each species.

As of January 2018, of the over 335 vertebrate genomes in the public NCBI database, only 9 meet our 3.4.2.QV40 phased metric. Of these 9, 7 were done by members of our G10K group using the above approaches. The other 2 are human and mouse and were only brought to this quality level after billions (human) and millions (mouse) of dollars were spent for continuous correction of these assemblies. However, none of these 9 genomes have been phased, and both have errors related to haplotype collapsing by that time.


Annotation and Alignment

We will use transcriptome data from all species, generated using RNASeq short reads or IsoSeq RNA long reads. The IsoSeq reads require no assembly and dramatically improve gene annotation structure. We will give highest priority to generating RNA transcripts of whole brain and of gonads (testes and ovaries) as brain and gonadal tissue has been empirically found to have the highest transcriptome diversity for genome annotation. This method will also be useful for studies of brain function and sex differences across species.

For whole genome alignments, we are working with Ensembl and UCSC to implement a reference-free approach, currently the Cactus algorithm. This approach does not limit cross-species gene annotations to only one or two reference species (currently human or mouse) but instead allows multiway species alignments which facilitates new gene discovery and a greater understanding of what is unique to humans and to each vertebrate lineage.

white.jpg