dbGaP logo

Submitting Sequence Data for a dbGaP project

Introduction

The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies investigating the interaction of genotype and phenotype in humans.

Find all dbGaP studies in SRA: "cluster dbgap"[Properties].

Genomic sequence data has the potential to personally identify the supposedly deidentified donor and therefore dbGaP data access is restricted to:

Researchers who request datasets for specific research uses
Institutional signing officials from the PI's home organization who certify and submit such requests
NIH staff who review and process requests

Researchers can apply for access to dbGaP data at the Authorized Access Portal.

Exclamation point DO NOT submit sequence data for a dbGaP study through the SRA Submission Portal.

All submissions that require controlled access must be submitted through dbGaP. The consent status of the human subjects in your study must be established prior to data transfer. If patients have not explicitly consented for the public release of their genomic data, it will be archived behind a controlled-access firewall. If you are unsure whether your patients' data should be stored in the authorized access archive, then we suggest you contact your institution's IRB to determine where your data should be deposited. Most large-scale human sequence data funded by NIH is subject to the NIH GDS policy and requires institutional certification of consent and deposition of data to a controlled-access repository such as dbGaP.

Submission Overview

Register a study and subjects with dbGaP

An interactive overview of dbGaP submission can be found here.

The dbGaP submission documentation package can be downloaded [ Download icon here.

All questions pertaining to this stage of submission should be directed to dbgap-sp-help@ncbi.nlm.nih.gov.

Submit sequencing metadata through the dbGaP submission portal

You will receive an email with an attached submission spreadsheet once your study and samples have been registered in dbGaP.

The sample column of the sequencing metadata spreadsheet will be pre-filled with your dbGaP identifiers (the study's accession and sample IDs). You will need to complete the spreadsheet with required technical details and the names and MD5 checksums of the sequence files you will be uploading.

Description of the columns in the spreadsheet

Name (Column) - Instructions

phs_accession (A) - This will be filled in for you in the spreadsheet you receive. It will contain the phs accession of the dbGaP study without the study version numbers.
sample_ID (B) - The sample IDs will be filled in when you receive the spreadsheet. If you need to submit more than one library per sample, you can copy and paste the same sample name in a new row. Be careful not to edit or change the sample names. The sample IDs you provided must match the sample IDs submitted in the Subject Sample Mapping (SSM) dataset. Remove sample IDs that will not have new sequencing submission for this version.
library_ID (C) - Each library_id must be unique within the submission and for all submissions for the same study in dbGaP and is primarily a unique identifier for the sequencing library. This value can be an internal identifier or can be just the sample name repeated if you do not have an additional identifier for the sequencing library. Please note that if you are submitting more than one sequencing library per sample you will need to make sure the library names do not repeat.
title/short description (D) - The title should be treated as a name or title that will help a user briefly identify what data was in the sequencing library and should be no longer than a single sentence.
library_strategy (E) - [Controlled Vocabulary] The library strategy must be selected from the list of possible values. These are provided both as a drop-down menu in the spreadsheet as well as a clickable link title to the Terms sheet where each option is described a bit more. This field is used by users searching for data so please choose the closest option and use the design description to detail any nuances this list doesn't include.
library_source (F) - [Controlled Vocabulary] The library source must be selected from the list of possible values. These are provided both as a drop-down menu in the spreadsheet as well as a clickable link title to the Terms sheet where each option is described a bit more. This field is used by users searching for data so please choose the closest option and use the design description to detail any nuances this list doesn't include.
library_selection (G) - [Controlled Vocabulary] The library selection must be selected from the list of possible values. These are provided both as a drop-down menu in the spreadsheet as well as a clickable link title to the Terms sheet where each option is described a bit more. This field is used by users searching for data so please choose the closest option and use the design description to detail any nuances this list doesn't include.
library_layout (H) - [Controlled Vocabulary] Select either 'single' or 'paired' from the list.
platform (I) - [Controlled Vocabulary] Select the sequencing platform manufacturer from the list of possible platforms. You must select this before selecting the instrument model.
instrument_model (J) - [Controlled Vocabulary] Select the model of instrument for the platform. You must select a platform first. After selecting the platform the list of possible models for that platform will be entered in the drop-down menu.
design_description (K) - The design description should be treated like a materials and methods description explaining how this library was prepared and sequenced. Please provide the design description as single line text without newlines or special characters and make the description long enough (at least 3 sentences; minimally 150 characters) so that a user can understand what the contents of any sequencing data files will be. Include kit name and version and part number if you have it for any kits. Avoid including information like the sequencing platform unless it is necessary to describe unique features of the library or process.
reference_genome_assembly (or accession) (L) - [Aligned Data Only] The reference genome used in the alignment. Do not include anything here if submitting unaligned files like FASTQ. Only the base reference genome is needed in most cases and only use a single name or accession. For example, "GRCh38" is the preferred way to enter the assembly, while "GRCh38/hg38" will likely cause delays in processing.
alignment_software (M) - [Aligned Data Only] Provide the alignment software that was used to generate the alignment in the data. Please include the software version if known.
filetype (N/Q) - [Controlled Vocabulary] Select the filetype from the list of options for the data being submitted.
filename (O/R) - The exact name of the file that will be uploaded. Include all extensions, but do not include the full or relative path information on your storage.
MD5_checksum (P/S) - A unique identifier generated using the MD5 algorithm, used to ensure that the upload process did not introduce any errors.
The spreadsheet contains space for two files, each with a filetype, filename, and MD5 checksum required. BAM submissions will typically have only a single file per library. Paired FASTQ data will typically have two files but sometimes will have more than two FASTQ files per sequencing library. Please split pairs of FASTQ files into subsets that are 250 GB or less when uncompressed. In those cases, additional columns of filetype, filename, and MD5 checksum can be added using the same column titles.

SRA: Transfer sequence files to the protected SRA account

If your sequence metadata spreadsheet is successfully loaded, you will receive email instructions to upload the sequence data.
If your sequence metadata spreadsheet is unable to be validated, you will receive email instructions to correct the file and re-upload.

Upload the data files to only the upload account provided to you after completing the sequence metadata spreadsheet.

The aspera upload account is: asp-dbgap@gap-submit.ncbi.nlm.nih.gov

Do NOT upload dbGaP sequence files to subasp@upload.ncbi.nlm.nih.gov or to any FTP address. Files uploaded to either location will not be transferred to a controlled access archive and will not correctly link to your dbGaP samples.

If submitting FASTQ, please split pairs of FASTQ files into subsets that are 250 GB or less when uncompressed.

Guidelines for using this account:

The account accepts data files (BAM, FASTQ, CRAM, etc.) but not XML submission files.
UDP transfer must currently be enabled for the following IP range: 130.14.*.* and 165.112.*.* The large range will avoid transfer issues when NCBI adds additional subnets to address increased usage.
Download Aspera to run ascp: https://www.ibm.com/aspera/connect/

The command line utility for aspera transfers is ascp. Run the following command.

Bash (macOS): export ASPERA_SCP_PASS=743128bf-3bf3-45b5-ab14-4602c67f2950   
Windows:      set ASPERA_SCP_PASS=743128bf-3bf3-45b5-ab14-4602c67f2950

An example ascp command:

ascp -i "[private-key-file]" -Q -l 200m -k 1 [file(s) to transfer] asp-dbgap@gap-submit.ncbi.nlm.nih.gov:[directory]

Where:

[private-key-file] is the full and absolute file path for 'aspera_tokenauth_id_rsa'. Depending on your computer operating system, Aspera Connect version, or where Aspera Connect is installed, the private-key-file path may look like the following examples: /opt/aspera/etc/aspera_tokenauth_id_rsa or "c:\Program Files (x86)\Aspera\Aspera Connect\etc\aspera_tokenauth_id_rsa" (quotes are required since the path has spaces)
[directory] is either test or protected
- test directory: Please direct your uploads to the test directory until you are confident your transmission command will work as intended.
- protected directory is for sending data to be processed by the SRA processing pipeline.
The -l flag sets the maximum bandwidth of request. Start with 100m; you may be able to go up to 1000 for optimum transfer rates.

Users uploading a large number of files are recommended to loop over files individually to avoid wildcards in the ascp command.

An example of upload loops:

Bash (macOS):

for F in ./*.bam
do
ascp -i [private-key-file] -Q -l 200m -k 1 $F asp-dbgap@gap-submit.ncbi.nlm.nih.gov:[directory]
done

Windows:

FOR %f IN (\*.bam) DO C:\install\directory\ascp.exe -i [key file] -l 200m -k 1 %f asp-dbgap@gap-submit.ncbi.nlm.nih.gov:[directory]

Confirm data receipt

Once all files and metadata have been uploaded, please confirm with your SRA Curator that the SRA portion of your dbGaP submission is complete. The curator can provide a report of files that were loaded. There is also a nightly report by samples provided on the dbGaP website. Change the accession phs000000 in the address below to your study for the report.

https://www.ncbi.nlm.nih.gov/gap/sstr/report/phs000000

XML Submission

XML submissions of sequencing metadata are only recommended for submitters who will be regularly submitting sequencing data for multiple dbGaP studies. If you would like to set up an account to upload metadata via XML then please notify an SRA Curator and they will assist you. For each study the submitter will need to upload three xml files packaged together in a tarball:

submission.xml
experiment.xml
run.xml

Submitters will create one entry for each library in the experiment.xml, and an entry for each BAM or production run in the run.xml. These XML files will be stored in a single tar archive and uploaded to an account at NCBI for the submitting center. The XML schemas are available here.

Not all possible combinations of XML will be present; please contact sra@ncbi.nlm.nih.gov if you need additional help formatting your XML.

Linking to a Registered dbGaP Study

In the <EXPERIMENT> XML:

<STUDY_REF accession="phs000000"/>

Linking to Registered dbGaP Samples

In the <EXPERIMENT> XML:

<SAMPLE_DESCRIPTOR refcenter="phs000000" refname="submitted_sample_id"/>

Contact SRA staff

Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov.

Getting Started

Getting Started

SRA

SRA