Scientific Supercomputing at the NIH

SOLID

Applied Biosystems SOLiD System provides a comprehensive suite of software. SOLiD Software Suite provides software tools for data processing and analysis generated on SOLiD Analyzer. It supports multiple applications, is integrable with custom analysis pipelines and can complete primary (image acquisition and quality control) and secondary (alignment to a reference genome, base calling, and SNP identification) analysis of fragment and mate-paired experiments.

Currently, the following software tools of Solid are installed on Biowulf and Helix. Feel free to contact staff@helix.nih.gov if user is interested in tools that are not listed.

[WTP] [Corona]

Whole Transcriptome Analysis Pipeline (WTP)

Tool to map transcriptome reads to a reference genome, count tags for exons and genes and view data in UCSC Genome Browser.

Programs Location of WTP on Helix

/usr/local/solid/ab_wtp

Version

1.1

Sample Session

Sample data can be downloaded from /usr/local/solid/ab_wtp_v1.0_testData.tar

- Login to biowulf.nih.gov

- Create a test directory and put the sample data under it (/data/username/solid/test1).

- Add java64 to your path and JAVA_HOME environmental variable

csh or tcsh users should add the following lines to the end of their /home/username/.cshrc file then source the file:

set path=( /usr/local/java64/jdk/bin ${path} )

setenv JAVA_HOME /usr/local/java64/jdk

% source /home/username/.cshrc

For or bash/ksh/sh users, insert the following at the end of your .bashrc file:

PATH=/usr/local/java64/jdk/bin:$PATH

JAVA_HOME=/usr/local/java64/jdk

$ source /home/username/.bashrc

- Create a configure file for Solid. User needs to modify it based on user data, example below can be obtained from /usr/local/solid/test1/config.txt:

------------ Content of config.txt -------------------------

# biowulf cluster uses PBS scheduler
SCHEDULING_ENVIRONMENT pbs

# MAX_MEMORY_PER_JOB_IN_BYTES decides size of input data and therefore total numbers of jobs for each analysis. It's found that for dual-core nodes which has largest (> 130 GB) /scratch area among all nodes, 3.9e9 is a good setting while 4.9e9 results in some jobs failing.
NAME_OF_QUEUE norm
MAX_MEMORY_PER_JOB_IN_BYTES 3.9e9
MEMORY_REQUIREMENT_ADJUSTMENT_FACTOR 1

#Settings for PBS, since WTP analysis require 64-bit node, > 4gb of memory and large working space,ask for dual-core nodes which has 8gb of memory and >100 gb of /scratch area in each node.
SCHEDULER_RESOURCE_REQUIREMENTS nodes=1:dc

RUN_READ_SPLITTING true
RUN_READ_FILTERING true
RUN_REFERENCE_PARTITIONING true
RUN_MAPPING true
RUN_EXTENSION true
RUN_MERGE true

#FILTERING MODE MUST BE OF: OFF, ONE_OR_MORE, BOTH
FILTERING_MODE ONE_OR_MORE

COMPRESS_INTERMEDIATE_FILES false
DELETE_INTERMEDIATE_FILES false

LENGTH_OF_READS 50
MASK_OF_READS 46..50
MAX_MAPPING_LOCATIONS_ALLOWED_BEFORE_NOT_REPORTING_READ 10
VALID_ADJACENT_MISMATCHES_COUNT_AS_ONE_FOR_MAPPING_AND_EXTENSION true
MATCHES_TO_IUPACS_VALID_FOR_MAPPING_AND_EXTENSION false
LENGTH_OF_FIRST_PART_OF_READ 25
LENGTH_OF_LAST_PART_OF_READ 30
MAX_MISMATCHES_IN_FIRST_PART_OF_READ_FOR_MAPPING 2
MAX_MISMATCHES_IN_LAST_PART_OF_READ_FOR_MAPPING 2
MAX_MISMATCHES_IN_READ_PARTS_FOR_READ_FILTERING 2
MIN_ALIGNMENT_SCORE_FOR_REPORTING_ALIGNMENT 24
MIN_GAP_IN_ALIGNMENT_SCORE_TO_SECOND_BEST_ALIGNMENT_FOR_UNIQUENESS 4

FILE_FILTER_SEQUENCES_FASTA /data/username/solid/wttest/human_filter_reference.fasta
FILE_REFERENCE_FASTA /data/username/solid/wttest/reference.fa
FILE_FULL_LENGTH_READS_CSFASTA /data/username/solid/wttest/input/maqc.brain.first100kReads.csfasta
FOLDER_FOR_TEMPORARY_FILES_ON_COMPUTE_NODES /scratch/
FOLDER_FOR_OUTPUT_FILES /data/username/solid/wttest/out

------------------- End of config.txt ---------------------

- Create a script file under the sample directory (/data/username/solid/test1). Example of a script file:

------------ Content of scriptfile -------------------

#!/bin/bash
#
# this file is scriptfile

#PBS -N solid
#PBS -k oe

/usr/local/solid/ab_wtp_v1.1/bin/split_read_mapper.sh /data/username/solid/test1/config.txt

------------------- End of scriptfile ------------------------------

- Submit your job to cluster using dual core nodes from biowulf:

biowulf> % qsub -l nodes=1:dc scriptfile

Note, the reference sequence file will be splitted based on its size and MAX_MEMORY_PER_JOB_IN_BYTES variable mentioned above. For example, the example file above is small and will not be splitted. While a 3.8GB sample file may be splitted to 15 smaller filess and run on 15 different nodes for each of the mapping, extension, and merging analysis in WTP.

Documentation

http://solidsoftwaretools.com/gf/project/transcriptome/docman/