Reference Sequences
SRA objects that contain read placements in addition to raw reads require a reference sequences in order to interpret them. The SRA objects containing these reference sequences must be installed in the local environment in order to dump data from SRA objects that depend on them.
Getting from SRA
User can either copy all referenced objects (<5Gb) or retireve specific ones needed for extracting data:
- SRA reference ftp contains all referenced objects
- configuration-assistant.perl script (included in sra toolkit distribution) can automatically obtain the needed reference sequences from NCBI for a given SRA object (internet access required).
Submitting to SRA
In order to process BAM data SRA need to know reference sequences used in alignment. BAM files describe used references through reference name and optional assembly name. SRA archive can recognize the following combinations:
- INSDC accession.version (i.e. CM000663.1). No assembly name needed in this case
- sequence name in known assembles from NCBI Assembly database
- names of the sequences in a fasta file provided as part of submission