Biowulf at the NIH
R on Biowulf
Rlogo

R (the R Project) is a language and environment for statistical computing and graphics. R is similar to the award-winning S system, which was developed at Bell Laboratories by John Chambers et al. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, ...). and is highly extensible. R provides an open-source alternative to S.

R is designed as a true computer language with control-flow constructions for iteration and alternation, and it allows users to add additional functionality by defining new functions. For computationally intensive tasks, C, C++ and Fortran code can be linked and called at run time.

Both 32-bit and 64-bit versions of R are available for use on Biowulf. Depending on your requirements, either R or R64 should be used. While R can be used on all nodes, R64 should be used only on the x86-64 nodes.

NOTE: R is not a parallel program. It is single-threaded, which means that it can only be run on 1 processor. Single, serial jobs are best run on your desktop machine or on Helix. There are two situations in which it is an advantage to run R on Biowulf:

For basic information about setting up an R job, see the R documentation listed at the end of this page. Also see the Batch Queuing System in the Biowulf user guide.

Create a script such as the following:

                   script file /home/username/runR
--------------------------------------------------------------------------
#!/bin/tcsh
# This file is runR
#
#PBS -N R
#PBS -m be
#PBS -k oe
date

/usr/local/bin/R --vanilla < /data/username/R/Rtest.r > /data/username/R/Rtest.out
--------------------------------------------------------------------------

Submit the script using the 'qsub' command, e.g.

qsub -v -l nodes=1 /home/username/runR

The swarm program is a convenient way to submit large numbers of jobs. Create a swarm command file containing a single job on each line, e.g.

                 swarm command file /home/username/Rjobs
--------------------------------------------------------------------------
/usr/local/bin/R --vanilla < /data/username/R/R1 > /data/username/R/R1.out
/usr/local/bin/R --vanilla < /data/username/R/R2 > /data/username/R/R2.out
/usr/local/bin/R --vanilla < /data/username/R/R3 > /data/username/R/R3.out
/usr/local/bin/R --vanilla < /data/username/R/R4 > /data/username/R/R4.out
/usr/local/bin/R --vanilla < /data/username/R/R5 > /data/username/R/R5.out
....
--------------------------------------------------------------------------
Submit this by typing:
swarm -f /home/username/Rjobs
Swarm will create the PBS batch scripts and submit the jobs to the system. See the Swarm documentation for more information.

The multicore package has been installed on Biowulf. Multicore provides functions for parallel execution of R code on machines with multiple cores or CPUs. Unlike other parallel processing methods all jobs share the full state of R when spawned, so no data or code needs to be initialized. The actual spawning is very fast as well since no new R instance needs to be started.

On the Biowulf cluster, multicore would be used to utilize all the processors on a node for a single R job. Users should be aware that the cluster includes single-core (2 processors per node) and dual-core (4 processors per node) nodes. When using 'multicore', it is simplest to always assume 4p per node and always submit to the dual-core ('dc') nodes.

If you are submitting a swarm of R jobs that each use multicore, each node should run only a single R command, since the multicore paralellization will utilize all the processors on that node. Thus, the swarm command should be :

swarm -n 1 -f myswarmfile -l nodes=1:dc

Documentation for multicore

Rmpi is a wrapper to the LAM implementation of MPI. [Rmpi documentation].
The package snow (Simple Network of Workstations) implements a simple mechanism for using a workstation cluster for ``embarrassingly parallel'' computations in R. [snow documentation]

Users who wish to use Rmpi and SNOW will need to add the path for LAM into their .cshrc or .bashrc files, as below:

setenv PATH /usr/local/etc:/usr/local/lam/bin:$PATH (for csh or tcsh)
PATH=/usr/local/lam/bin:$PATH (for bash)

To run Rmpi on multiple nodes, LAM must be started on those nodes with the lamboot command before Rmpi is loaded. Any spawned Rmpi slaves must be shut down with mpi.close.Rslaves() or mpi.quit() before exiting R, and lamhalt must be run to shut down LAM before exiting the batch job.

Sample Rmpi batch script:

------- this file is myscript.bat--------------------------
#!/bin/csh
#PBS -j oe

cd $PBS_O_WORKDIR
lamboot $PBS_NODEFILE

/usr/local/bin/R --vanilla > myrmpi.out <<EOF
library(Rmpi)
mpi.spawn.Rslaves(nslaves=$np)
mpi.remote.exec(mpi.get.processor.name())
n <- 3
mpi.remote.exec(double, n)
mpi.quit()

EOF

lamhalt
--------------------------------------------------------------

Sample batch script using snow:

------- this file is myscript.bat--------------------------
#!/bin/csh
#PBS -j oe

cd $PBS_O_WORKDIR
lamboot $PBS_NODEFILE

/usr/local/bin/R --vanilla > myrmpi.out <<EOF
library(snow)
cl <- makeCluster($np, type = "MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
clusterCall(cl, runif, $np)
stopCluster(cl)
mpi.quit()
EOF

lamhalt
--------------------------------------------------------------

Either of the above scripts could be submitted with:

qsub -v np=4 -l nodes=2 myscript.bat
Note that it is entirely up to the user to run the appropriate number of processes for the nodes requested. In the example above, the $np variable is set to 4 and exported via the qsub command, and this variable is used in the script to run 4 snow processes on 2 dual-cpu nodes.

Production runs should be run with batch as above, but for testing purposes an occasional interactive run may be useful.

Sample interactive session with Rmpi: (user input in bold)

[user@biowulf ~]$ qsub -I -l nodes=2
qsub: waiting for job 136623.biobos to start
qsub: job 136623.biobos ready

[user@p227 ~]$ lamboot $PBS_NODEFILE

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

[user@p227 ~]$ lamexec C hostname        #just checking hostnames)
p227
p228
[user@p227 ~]$ R

R : Copyright 2006, The R Foundation for Statistical Computing
Version 2.3.1 (2006-06-01)
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> library(Rmpi)
> mpi.spawn.Rslaves(nslaves=4)
        4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: p227
slave1 (rank 1, comm 1) of size 5 is running on: p227
slave2 (rank 2, comm 1) of size 5 is running on: p228
slave3 (rank 3, comm 1) of size 5 is running on: p227
slave4 (rank 4, comm 1) of size 5 is running on: p228
> demo("simplePI")

        demo(simplePI)
        ---- ~~~~~~~~

Type  <Return>   to start :

> simple.pi <- function(n, comm = 1) {
    mpi.bcast.cmd(n <- mpi.bcast(integer(1), type = 1, comm = .comm),
        comm = comm)
    mpi.bcast(as.integer(n), type = 1, comm = comm)
    mpi.bcast.cmd(id <- mpi.comm.rank(.comm), comm = comm)
    mpi.bc .... [TRUNCATED]
> simple.pi(100000)
[1] 3.141593
> mpi.quit()                      #very important
[user@p227 ~]$ lamhalt

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

[user@p227 ~]$ exit
logout

qsub: job 136623.biobos completed

Sample interactive session with snow: (user input in bold)

[user@biowulf ~]$ qsub -I -l nodes=2
qsub: waiting for job 136706.biobos to start
qsub: job 136706.biobos ready

[user@p227 ~]$ lamboot $PBS_NODEFILE

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

[user@p227 ~]$ lamboot $PBS_NODEFILE

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

[user@p227 ~]$ R

R : Copyright 2006, The R Foundation for Statistical Computing
Version 2.3.1 (2006-06-01)
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> library(snow)
> cl <- makeCluster(4, type = "MPI")
Loading required package: Rmpi
        4 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename  machine 
  "p227" "x86_64" 

[[2]]
nodename  machine 
  "p228" "x86_64" 

[[3]]
nodename  machine 
  "p227" "x86_64" 

[[4]]
nodename  machine 
  "p228" "x86_64" 

> sum(parApply(cl, matrix(1:100,10), 1, sum))
[1] 5050
> system.time(unlist(clusterApply(cl, splitList(x, length(cl)),
+                                 qbeta, 3.5, 4.1)))
[1] 0.017 0.000 0.022 0.000 0.000
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.01032138 0.62865716 0.62550058

[[2]]
[1] 0.01032138 0.62865716 0.62550058

[[3]]
[1] 0.01032138 0.62865716 0.62550058

[[4]]
[1] 0.01032138 0.62865716 0.62550058

> stopCluster(cl)
[1] 1
> mpi.quit()
[user@p227 ~]$ lamhalt

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University
[user@p227 ~]$ exit
logout

qsub: job 136706.biobos completed