skip to main content
RSS Feed

Instructions for downloading and using the SRA Toolkit

Table of Contents

  1. Usage
  2. Examples

Contact: sra-tools@ncbi.nlm.nih.gov

The following guide walks the user through download and usage of the SRA Toolkit, focusing mostly on Linux users with the emphasis on use of cSRA files. This guide covers the basic usage of SRA Toolkit. For information regarding the parameters and usage of individual tools in the SRA Toolkit please refer to the usage files that are included in the downloaded files.

The NCBI SRA Toolkit generates loading and dumping tools with their respective libraries for building new and accessing existing runs. It may be built with GCC, ICC or Microsoft VC++, however pre-built software executables are available for various platforms and we highly recommend using existing pre-built executables from the SRA software website.

Usage

Make sure you have standard utilities such as tar. Most Linux systems include such tools.

Download the SRA Toolkit from the SRA website. Downloading the compiled binaries is your best option. To download the Toolkit:

  1. If you are working from a GUI interface with web browsing capability, you can directly download the Toolkit from the NCBI-SRA website (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software) to your desktop
  2. If working in Linux command prompt use wget utility. For example, get binaries for CentOS Linux 64 bit architecture:
    wget "http://ftp-private.ncbi.nlm.nih.gov/sra/sdk/2.2.2a/sratoolkit.2.2.2a-centos_linux64.tar.gz"

Check the MD5 checksum to make sure it matches the posted md5sum on the web by running the md5sum command:

md5sum -b sratoolkit.2.2.0-centos_linux64.tar.gz

Untar the downloaded SRA Toolkit with decompress (unzip) option:

  1. tar xzf sratoolkit.2.2.0-centos_linux64.tar.gz
  2. You should see a directory with the name of the file minus tar.gz. For example:
    sratoolkit.2.2.0-centos_linux64
  3. Inside the directory you will see 3 names per tool in the format of:
    1. Toolname (for example vdb-dump)
    2. ToolName.X (where X is the major version, i.e. 1, 2, … for example: vdb-dump.2)
    3. ToolName.X.y (Where y is the minor version, i.e. 1.2, 4.3, … for example: vdb-dump. 2.2.0)
    You can use the tool name without the version extension which automatically points to the versioned tool.

For the purposes of demonstrating toolkit in this guide, we are using an arbitrary Run SRR390728 (RNA-Seq (polyA+) analysis of DLBCL cell line HS0798), from the National Cancer Institute Cancer Genome Characterization Initiative (CGCI) Project that can be downloaded from the NCBI SRA web site.

Before using any of the tools, you should run configuration-assistant.perl to setup your environment. This script checks to make sure you have the basic tools and will setup your configuration and directories needed by the tool. There is special configuration for cSRA (alignment reference), which the configuration-assistant.perl will setup for you.

The special configuration includes setting up a directory to download reference files for cSRA files and downloading the reference files. In addition it will guide you in setting up the password (key) file for encrypting and decrypting files.

  1. Run the configuration-assistant.perl command:
    • In Linux: ./configuration-assistant.perl
    • In Windows: perl ./ configuration-assistant.perl
  2. When prompted "Do you want to create configuration?", type "y" or "yes".
  3. When prompted "Specify installation directory for reference objects .[/home/USERNAME/ncbi/refseq]:" you can either take the default or if you plan to share the alignment reference data you can specify a common directory for all users. To accept the default directory press Enter.
  4. When prompted "Do you want to update VDB password?" type in "y" to setup one or "n" to skip this step. You can run the configuration-assistant.perl again later to setup or change a password.
  5. When prompted "Would you like to test cSRA files for remote reference dependencies?", you can enter "y" or "yes" to test, but only if you have cSRA files downloaded already that you can test with. If you enter a cSRA file name, the perl script will continue to prompt you again after downloading the specified cSRA file’s reference until you press Enter to skip.
  6. You only need to run configuration-assistant.perl once to create the reference file directory, but you need to run it each time you add a new cSRA file to your collection. In the future you can run configuration-assistant and pass one or more cSRA file names as parameters.
    For example:
    configuration-assistant.perl SRR390728.sra

Download your cSRA file from SRA site using Aspera or FTP: Go to SRA web site and search for the sequence to download. Once you find the downloadable sequence file click on the file and it will automatically start Aspera.

After downloading your cSRA file, you need to download related alignment reference file(s) by running the configuration-assistant.perl again. For example: configuration-assistant.perl SRR390728.sra

If you do not download all the proper references for an aligned SRA file, the dumpers will display error messages alerting you that you need to download the correct references. The following screen capture shows the error messages by running vdb-dump SRR390728.sra –R 1 which should dump Row 1 of the run info. The error message, "Reference sequence … was not found" should alert you that the correct references were not downloaded.

You may also run configuration-assistant.perl without a file name and it will prompt you for the cSRA file name. When prompted to "Enter cSRA file", enter the downloaded cSRA file name with the relative path to the current directory.

You should look for a message that says "All 1 references were checked (1 downloaded)". If the operation fails, you will see an error message accordingly.

You can now run the toolkit tools such as sam-dump, fastq-dump or vdb-dump against the cSRA file, which are the primary tools for working with SRA and cSRA files in the Toolkit.

If you are new to Unix or using Windows PowerShell environment, you will need to set your path to "." in the sub-directory where you un-tarred the SRA toolkit.

If you fail to download the correct reference alignment files, you will see error messages when you run the tools such as sam-dump or vdb-dump. A common error message will say "Table not found while opening table within alignment module" when you run vdb-dump. It means you need to run the reference-assistant.pl to get the reference alignment file.

Now you can view the information in the cSRA file or dump it to another format. For example run the following to get the basic information for Read #1 of the tables:

./vdb-dump SRR390728.sra -T PRIMARY_ALIGNMENT -R 1

The following dumpers do not support cSRA

  1. sff-dump
  2. abi-dump
  3. illumina-dump

Use Sam-dump to convert cSRA to SAM format. For more information about SAM format and what the various fields mean refer to the SamTools documentation online provided by the SamTools working group which is not affiliated with NCBI. The following are helpful tips about NCBI sam-dump software:

  1. Using sam-dump will NOT write the unaligned data by default. To get the unaligned data in addition to other data use the "--u" parameter
  2. Use "--aligned-region" to write only the named aligned regions
  3. Use –r parameter to reconstruct the references in the header if needed. Please keep in mind that this is an expensive operation and should only be done if you believe the header needs to be regenerated, which is only applicable if the references submitted along with the sequence are not correct or complete.

You can convert the cSRA file to FASTQ format using the fastq-dump program: fastq-dump name-of-your-csra-file.csra will produce a file with the same name as the cSRA file with the extension of .fastq cSRA's contain at least three "tables": REFERENCE, PRIMARY_ALIGNMENT, and SEQUENCE. cSRA's may optionally contain a SECONDARY_ALIGNMENT table. When using fastq-dump, you can dump sequence data from any of these. By default, fastq-dump works on the SEQUENCE table. sam-dump joins these tables together to recreate the SAM source data. sam-dump supports slicing through the --aligned-region parameter.

Conversion to BAM requires an additional step of converting produced SAM file to BAM using samtools from the SAMtools website http://samtools.sourceforge.net/.

New command line options have been added to fastq-dump specifically for cSRA:

  1. --table <table-name>
  2. --aligned
  3. --unaligned
  4. --aligned-region <name[:from-to=""]>
  5. --matepair-distance <from-to|unknown>

Examples

This will create a SAM file called my_sam.sam with the alignments from the cSRA file

sam-dump SRR390728.sra >my_sam.sam

This will create a bam file out of a cSRA:

am-dump -u SRR390728.sra | samtools view -Sb -o some.bam -

Example for sam-dump to convert cSRA to BAM:

sam-dump SRR390728.sra | ~user/samtools-0.1.12a/samtools view -S -b - > some.bam

Example for slicing across multiple files

sam-dump --aligned-region 20:1000000-1100000 SRR390728.sra > TestSclice1.txt

Examples for fastq-dump:

This will dump all the raw sequences from SRR390728.sra. This is exactly analogous to using fastq-dump on an ordinary SRA object

fastq-dump SRR390728.sra

This will dump all the sequences from primary alignments within SRR390728.sra

fastq-dump SRR390728.sra --table PRIMARY_ALIGNMENT

Encryption and Decryption of files. The SRA Toolkit includes encryption and decryption tools named vdb-encrypt and vdb-decrypt. The vdb-encrypt tool uses the password you set previously by running the configuration-assistant.perl as a key to encrypt a named file. To encrypt our example file, SRR390728.sra, run

vdb-encrypt SRR390728.sra

If you run vdb-encrypt (or vdb-decrypt) without any parameters as above, the new encrypted (or decrypted) will replace the existing file without warning. You can write each to a new file as follow:

vdb-encrypt SRR390728.sra SRR390728Encrypted.sra
vdb-decrypt SRR390728Encrypted.sra SRR390728Decrypted.sra

Running vdb-decrypt is actually unnecessary in this example and it will not decrypt the Encrypted file, because SRA Toolkit tools such as vdb-dump decrypt on the fly as they are dumping the data and so there is no point in decrypting an SRA file. In the above example, the vdb-decrypt will read the file and once it recognizes it as an SRA file, it will simply exit without warning. You can force the decrypt by issuing the following command

vdb-decrypt –force SRR390728Encrypt.sra

You can also encrypt and decrypt an entire directory. Your encrypt and decrypt use the password (key) stored in the configuration file that was setup by running configuration-assistant.perl. If you encrypt a file and then change the password (key), the decrypt will fail on that file.