Biowulf at the NIH
RSS Feed
MapSplice on Biowulf

Accurate mapping of RNA-seq reads for splice junction discovery.

MapSplice was developed at the U. Kentucky Bioinformatics lab.

The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands, 'module load mapsplice' , as in the example below.

biowulf% module avail mapsplice

-------------------- /usr/local/Modules/3.2.9/modulefiles ---------------------
mapsplice/1.15.2

biowulf% module load mapsplice

biowulf% module list
Currently Loaded Modulefiles:
  1) mapsplice/1.15.2

Submitting a single batch job
The 'module load mapsplice' command will set an environment variable called MSBIN which can be used in the python commands as in the example below.

1. You will either need to copy the MapSplice configuration file and edit it for your own needs, or put all the options on the command line. There are several sample configuration files in /usr/local/mapsplice/. Copy one of them and edit the appropriate sections to define the input files, reference genome, bowtie indexes etc.

cp /usr/local/mapsplice/paired.cfg /data/user/mydir

2. Create a batch script file similar to the one below:

#!/bin/bash
# This file is YourOwnFileName
#
#PBS -N yourownfilename
#PBS -m be
#PBS -k oe

module load mapsplice
cd /data/user/mydir
python $MSBIN/mapsplice_segments.py   Run1.cfg

2. Submit the script using the 'qsub' command on Biowulf, e.g.

[user@biowulf]$ qsub -l nodes=1 /data/username/theScriptFileAbove

Submitting a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

cd /data/user/mydir1; python $MSBIN/mapsplice_segments.py Run1.cfg
cd /data/user/mydir1; python $MSBIN/mapsplice_segments.py Run2.cfg
cd /data/user/mydir1; python $MSBIN/mapsplice_segments.py Run3.cfg
   [...]   

Submit this job with

swarm -f cmdfile --module mapsplice

By default, each line of the command file above will run on one core of a node using up to 1 GB of memory. The bowtie section of MapSplice can run in multi-threaded mode. If you specify more than 1 thread for Bowtie (either using '-X #', or '--threads #', or setting 'threads=#' in the .cfg file), then you must tell swarm how many threads each command will use using the '-t #' flag to swarm.. For example, if you set '--threads 8', then you should submit swarm with:

swarm -t 8 -f cmdfile --module mapsplice

If each command requires more than 1 GB of memory, you must tell swarm the amount of memory required using the '-g #' flag to swarm. For example, if each mapsplice command (a single line in the file above) requires 10 GB of memory and you are running with 8 threads, you would submit the swarm with:

swarm -g 10 -t 8 -f cmdfile --module mapsplice

For more information regarding running swarm, see swarm.html

 

Running an interactive job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf] $ qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ cd /data/user/myruns
[user@p4]$ module load mapsplice
[user@p4]$ cd /data/userID/mapsplice/run1
[user@p4]$ python $MSBIN/mapsplice_segments.py -Q fq -o output_path -u file1.fastq -c /fdb/genome/hg19/chr_all.fa -b /usr/local/bowtie-indexes --threads 4 -L 18 2 > output.log
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf]$

You may add a node property in the qsub command to request specific interactive node. For example, if you need a node with 24gb of memory to run job interactively, do this:

[user@biowulf]$ qsub -I -l nodes=1:g24:c16

 

Documentation

http://www.netlab.uky.edu/p/bioinfo/MapSpliceManual