Limited Embargo Data Use Policy for the Vertebrate Genomes Project (VGP)

Revised Embargo Policy Implemented May 1st, 2024 Last updated July 8th, 2024

Overview of the May 2024 update: The Vertebrate Genomes Project (VGP) data use policy was first implemented and posted live in January 2018. Since then, more restricted changes have occurred in policies of some governments in sharing of unpublished scientific data with specific companies or institutions, changes in policies have occurred in the genomics community, and there is a need for the data to become usable by the scientific community sooner. Further, some of the wording of the original VGP Data Use Policy was unclear. Key changes are: making data more freely available for unrestricted use earlier (1-year replacing the 2-year embargo by the data producers); combining the internal and external data use policies into just one policy document for all; and setting a clearer process for determining and identifying when a genome is released from embargo. These changes were approved by G10K-VGP Council, and take effect for all new genomes started on or after May 1st, 2024. For all genomes started before this date, they are grandfathered on the previous policy dated Jan 9, 2018.

The goal of the VGP is to generate at least one high-quality, error-free, near gapless, chromosome-level, haplotype phased, and annotated reference genome assembly for all extant ~70,000 vertebrate species, and to utilize those genomes to address fundamental questions in biology, disease, and conservation. The VGP was made official in February 2017, born out of the G10K project effort, which began earlier in 2009. The VGP deposits raw and assembled genome and transcriptome data publicly before publishing on those genomes. To support fair and productive use of this data, the G10K Council has developed the following data use policy, which is consistent with those of widely used public annotation and genome databases (Ensembl, NCBI, and UCSC), and has been enforced by journal editors (e.g. Nature, Science). We ask all users to respect and follow this data use policy. The policy follows standards in genomics, and is in part adopted from a previous (prior to 2017) Sanger Genome Institute Policy.

VGP Data Use Policy

Before publishing on them, the VGP releases raw reads, assembled genomes, transcriptome sequence data, and annotations as a service to the research community. These data releases occur through the public archives, such as GenBank at NCBI, European Nucleotide Archive (ENA) and ArrayExpress at EMBL-EBI, the DNA Data Bank of Japan (DDBJ), and the GenomeArk, the later a public Amazon Web Services (AWS) S3 Bucket dedicated as a working space and home for high-quality reference genomes of the VGP and affiliated projects (such as Earth BioGenome Project (EBP)). The VGP leadership encourages others to use these data, but expects them to respect the right of data producers, the sample providers, and collaborators to first presentation (including journal publications, pre-prints such as in bioRxiv, public conference talks, and press releases) of genome-wide analyses, including for phylogenetic and evolutionary studies. All such persons with the right of first presentation are free to publish on and release the genome for unrestricted use at any time from the time of completion of data. For those not responsible for producing or sponsoring the data, exceptions to the policy during the embargo period are on analyses of either a single locus or a single gene family across species, or a maximum of 5 gene loci or gene families within a species, and for use as a reference for mapping reads from independent studies.

Timeline of Embargo

The timeline of the embargo period is 1 year from the release date of the annotated genome assembly in the public archives (e.g. NCBI, ENA and DDBJ) or publication of the genome by the data producers in under 1 year, whichever comes first; this timeline supersedes the prior 2-year embargo period. If the genome has not been annotated in any of the public archives within the 1-year embargo period that the genome was deposited, the embargo release date automatically defaults to the date that the unannotated genome was deposited. The simplest way to find the dates is by first going to the VGP BioProject page (https://www.ncbi.nlm.nih.gov/bioproject/489243) and navigating from there to the specific species and then the associated genome assembly accession pages. For example, on the primary assembly page for the Anna’s hummingbird), the unannotated genome was deposited May 19, 2019, and the annotated version deposited August 5, 2019 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_003957555.2/). This means that under a 1-year policy, the genome would have been released from embargo on August 5, 2020. In actual practice, this genome was released under the 2-year embargo and published in 2021 in a special issue in Nature on initiation of the VGP (Rhie et al 2021). The specific embargo or lack of embargo will be stated in the data BioProject description page for each genome. At the time that the embargo is released due to publication, the genome assembly will also be updated with the citation in the public archives. In all cases before or after public release, the relevant persons or publications responsible for the genome should be cited in studies using the genomes. Exceptions to the 1-year embargo are for affiliated projects that have less than one year embargo, immediate data note publications, or no embargo at all (e.g. The Darwin Tree of Life project, Revive & Restore sponsored genomes, and Colossal BioSciences sponsored genomes). For affiliated projects that have embargos longer than 1 year, they will not be included under the VGP umbrella until after public release from embargo. A full list of thus far contributing genome projects is at the end of this document.

Data Sharing by VGP members

Data generated by VGP members requires that researchers share their data as widely and effectively as possible with the following considerations:

  • Access – Primary data and final assemblies should be deposited in accessible repositories. Dataset accession numbers must be included in any publications.

  • Rights of Data Providers - Researchers should be appropriately credited for their contribution to data generation and analyses.

To support fair and productive use of this resource within the consortium.

  1. The Chair will strive to ensure that all VGP members are eligible to receive company discounts on reagents, compute, and other resources negotiated under the VGP umbrella.

  2. All VGP members have pre-publication access to all genomes generated by and/or for the VGP, regardless of source of the data. All VGP members are encouraged to participate collaboratively in analyses whenever possible.

  3. Unpublished data from another’s VGP contributed genomes cannot be used for publication within the 1-year embargo by other VGP members without the permission of the leadership responsible for those VGP affiliated genomes.

  4. VGP members working on competing projects within the VGP must inform the VGP Chair, to help minimize conflicts of interest and manage the greatest possible impact of the VGP.

  5. Each VGP member is expected to follow the embargo policy when it pertains to other data that they did not generate.

  6. VGP members are discouraged from publishing multi-genome papers before specific Phases are completed, such as Phase 1 (planned publication list below), to help maintain a high impact of those Phased publications and to help raise funds from such an impact for subsequent Phases. Exceptions will be approved by VGP Council and the Chair.

Planned studies for Phase 1 of the VGP:

The VGP consortium intends to use the genomic data that it produces for multiple studies. A list of studies planned for the Phase 1 ordinal VGP include (~270 ordinal lineages):

  1. Genome-scale family tree of vertebrates

  2. Comparative genomics of specialized traits in each vertebrate lineage

  3. Comparative genomics of convergent traits (e.g. vocal learning, flight, loss of limbs, and aquatic / terrestrial adaptations).

  4. Developing universal vertebrate gene orthology and nomenclature

  5. Deciphering vertebrate chromosomal genome evolution

  6. Reconstruction of the common ancestor genomes of all vertebrates and of key vertebrate clades (e.g. mammals, birds, reptiles, amphibians, teleost fish, jawed vertebrates, and tetrapods)

  7. Evolution of nucleotides to chromosomes of the human genome

  8. Genetics of why some lineages are more disease resistant than others

  9. Conservation genomics of endangered species sequenced

  10. The genomes of all remaining Kakapo parrots on the planet

  11. Genetic signatures of domestication across vertebrates

  12. Genetics of sex determination and sex chromosome evolution among vertebrates

  13. Brain cell type evolution and homologies using genomics and transcriptomic

  14. 3-Dimensional genome structure across vertebrates

  15. Consequences of the evolutionary battle between transposons and host factors

This list will be periodically updated as data and publications from the VGP are generated. It will be considered completed when the Phase 1 publications are done, with release of the associated genomes.

VGP Data Use Contact Inquires

For inquiries, including on this Data Use Policy, referencing VGP data, or joining the VGP, contact the VGP Chair, currently Erich D. Jarvis, ejarvis@rockefeller.edu, copying when appropriate the individuals responsible for the genome(s) or questions of interest. This includes persons that are interested in using the VGP data for the above studies.

 Members of the G10K Council can be found at https://genome10k.soe.ucsc.edu/leadership.

Projects that have contributed genomes to the VGP

In the first iteration of the VGP Data Use Policy in 2018, besides the VGP proper, two projects had formed and contributed genomes to the VGP (Bat 1K and Sanger 25 Genomes Annivesary). As of May 1st, 2024, the number of contributing, affiliated projects is 24. This list is below, and will be updated as more projects contribute.

Phylogenetic based

Earth BioGenome Project (EBP)

Global Invertebrate Genomics Alliance (GIGA)

Bird 10,000 (B10K)

Bat1K

Paratus Sciences Bat Project

Cetaceans Genome Project

 

Geographic based

Darwin Tree of Life (DToL)

European Reference Genome Atlas (ERGA)

Catalan Initiative for the Earth Biogenome Project (CBP)

African BioGenomes Project (AfricaBP)

Minderoo Foundation: Australia and New Zealand Aquatic Vertebrates

AmaZOOmics: Brazilian Biodiversity

California Conservation Genomes Project

Earth BioGenome Project - Columbia

Canadian Biogenome Project

 

Conservation based

Revive & Restore

Colossal BioSciences

 

Other project based

Sanger 25 Genomes Project

Telomere-To-Telomere Consortium

Human Pangenome Project

Allen Institute for Brain Science

Chan-Zuckerberg Initiative

Tabula Madagascar

HHMI COVID-19 project


Data Use policies for genomes sequenced between January 9th, 2018 and June 30th, 2024