TECHNOLOGY

We work with industry partners to improve high-resolution genome sequencing workflows and computation pipelines at low costs.

 
VGP_Technology.png

A major challenge for saving species and conducting high-quality genomic research is that most current genome assemblies have hundreds of thousands to millions of errors in them.

Parts of genes are missing, some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. Due to these fragmented, error-prone assemblies, researchers have had to clone, re-sequence, and correct individual genes. In some cases, the gene structures are too complex. In many other instances, investigators do not even know that they are working with incorrect gene sequences and structures, impacting many scientific findings and scientific progress. Thus, high-quality error-free genome assemblies and annotations are necessary.

To reduce these errors, we worked with industry partners to develop unprecedented high-resolution genome sequencing methods at significantly lower costs than current, less robust technologies. Some of the industry and academic partners thus far from 2015 to 2018 have been the major sequencing and assembly companies (e.g., Illumina, Pacific Biosciences, Oxford Nanopore, Bionano Genomics, 10X Genomics, NRGene, Dovetail Genomics, Phase Genomics, Arima Genomics), sequencing centers (BGI, Broad Institute, Sanger Institute, Washington University Genome Center), public genome archive and annotation centers (NCBI, Ensembl, UCSC), and experts in academia and government (NIH, NSF) to test, improve, and generate new approaches for producing the highest quality, near error-free, 3rd generation reference genome assemblies achievable for the least cost possible. For the first time, we tested each of these technologies on the same animals, a bird (hummingbird) and a mammal (goat or human), such that our analyses were not hampered by the common problem of multiple variables changing simultaneously.

 

EXAMPLE 1

IMAGE A is a representation of 1st and 2nd generation genome sequencing methods; puzzle pieces are smaller to represent the short reads, and the holes represent missing information. IMAGE B is a representation of the 3rd generation technologies that…

IMAGE A is a representation of 1st and 2nd generation genome sequencing methods; puzzle pieces are smaller to represent the short reads, and the holes represent missing information.

IMAGE B is a representation of the 3rd generation technologies that we are using for the VGP, where puzzles pieces are larger to represent long reads.

EXAMPLE 2

IMAGE A corresponds to 1st and 2nd generation genome sequencing methods; the short reads from these methods are being approximated to the height of a human. IMAGE B corresponds to the 3rd generation technologies that we are using for the VGP; the lo…

IMAGE A corresponds to 1st and 2nd generation genome sequencing methods; the short reads from these methods are being approximated to the height of a human.

IMAGE B corresponds to the 3rd generation technologies that we are using for the VGP; the long reads produced with these 3rd generation technologies are being approximated to the height of the Empire State Building and halfway to the moon.

iStock-673282272_Sm.jpg
 

Algorithm Development

For our VGP genomes, the G10K set a minimum genome quality metric: contig N50 of 1 million bp (1Mb), scaffold N50 of 10Mb, 90% of the genome assembled into chromosomes confirmed by 2 independent sources, a base-call quality error of no less than QV40 (no more than 1 nucleotide error in 10,000 bp), and haplotype phased. We call this a 3.4.2.QV40 PHASED METRIC, where the first three numbers are the exponents of the N50 contig, N50 scaffold, and level of chromosomal assembly. We will sequence the heterogametic sex (when it exists) so that both sex chromosomes can be assembled for each species.

As of 2018, the following genomes are the one approaching the VGP genome quality metric:

2016 

1. Capra hircus (Goat)
2. Oreochromis niloticus (Nile tilapia)
3. Gorilla gorilla gorilla (western lowland gorilla) 
4. Sus scrofa (pig)
5-7. Oryzias latipes (Japanese medaka)
8. Astyanax mexicanus (Mexican tetra)
9. Ovis aries (sheep) 

 2017

10. Felis catus (domestic cat)
11. Xiphophorus maculatus (southern platyfish) 
12. Sus scrofa (pig) – cross breed 

 2018 

13. Equus caballus (horse)
14. Pongo abelii (Sumatran orangutan)
15. Pan troglodytes (chimpanzee)
16. Gallus gallus (chicken)
17. Maylandia zebra (zebra mbuna)
18. Bos taurus (cattle)
19. Amphiprion percula (orange clownfish) 

 * List excludes human and mouse genomes

The human and mouse genomes were only brought to this quality level after billions (human) and millions (mouse) of dollars were spent for continuous correction of these assemblies. However, none of these genomes have been phased, and both have errors related to haplotype collapsing.


Annotation and Alignment

We will use transcriptome data from all species, generated using RNASeq short reads or IsoSeq RNA long reads. The IsoSeq reads require no assembly and dramatically improve gene annotation structure. We will give highest priority to generating RNA transcripts of whole brain and of gonads (testes and ovaries) as brain and gonadal tissue has been empirically found to have the highest transcriptome diversity for genome annotation. This method will also be useful for studies of brain function and sex differences across species.

For whole genome alignments, we are working with Ensembl and UCSC to implement a reference-free approach, currently the Cactus algorithm. This approach does not limit cross-species gene annotations to only one or two reference species (currently human or mouse) but instead allows multiway species alignments which facilitates new gene discovery and a greater understanding of what is unique to humans and to each vertebrate lineage.

 
Gif bombillas.gif
white.jpg