N I H H e l i x S y s t e m s
Steven Fellini
sfellini@nih.gov
CIT
1 Dec 2005
This page is at
http://biowulf.nih.gov/easy.html
The Biowulf Home Page is at
http://biowulf.nih.gov
Helix (SGI) | Biowulf (cluster) |
one computer system with CPUs, memory and disks | many systems (nodes) |
proprietary hardware and software | commodity hardware and open software (Linux) |
moderate number of CPUs (8-32) | 2000+ CPUs |
shared memory | distributed memory |
large memory (8-32 GB) | smaller memory (1-4 GB) |
computation on login system | computation on computational nodes |
system runs several applications simultaneously | node dedicated to one computation |
interactive | queuing system (batch) |
|
nodes (2p) | processors | memory | networks |
805 | AMD Opteron 2.8, 2.2 & 2.0 GHz | 2 & 4 GB | Infiniband, Myrinet & Gigabit ethernet |
388 | Intel Xeon 2.8 GHz | 1, 2 & 4 GB | Myrinet, Gigabit ethernet & Fast ethernet |
203 | AMD Athlon 1.8 & 1.4 GHz | 1 & 2 GB | Myrinet & Fast ethernet |
Location | Creation | Backups | Performance | Amount of Space | Accessible from (*) | |
/home | network (NFS) | with Biowulf account | yes | high | 200 MB (quota) | B,C |
/scratch (nodes) | local | created by user | no | best | 6 - 30 GB dedicatedwhile node is allocated | C |
/scratch (biowulf) | network (NFS) | created by user | no | low | 120 GB shared | B,H,N |
/data | network (NFS) | with Biowulf account | yes | high | based on quota (48 GB default) | B,C,H,N |
$ ls -l foo.tmp -rw-r--r-- 1 steve wheel 2 Mar 11 2005 foo.tmp [steve@biobos steve]$ rm foo.tmp [steve@biobos steve]$ ls -l foo.tmp ls: foo.tmp: No such file or directory [steve@biobos steve]$ cd .snapshot [steve@biobos .snapshot]$ ls _hourly.0 _hourly.3 _nightly.0 _nightly.11 _nightly.2 _nightly.5 _nightly.8 _weekly.1 _hourly.1 _hourly.4 _nightly.1 _nightly.12 _nightly.3 _nightly.6 _nightly.9 _weekly.2 _hourly.2 _hourly.5 _nightly.10 _nightly.13 _nightly.4 _nightly.7 _weekly.0 _weekly.3 [steve@biobos .snapshot]$ cd _nightly.0 [steve@biobos _nightly.0]$ ls -l foo.tmp -rw-r--r-- 1 steve wheel 2 Mar 11 2005 foo.tmp [steve@biobos _nightly.0]$ cp foo.tmp /home/steve
Not Suitable:
Phylogenetic/Linkage Analysis
Open a connection to biowulf.nih.gov (or
helix.nih.gov)
Change directory to /data/username/
Put your files into that directory.
biobos% easyblast EasyBlast: Blast for large numbers of sequences Enter the directory which contains your input sequences: data/username/blast/myseqs Enter the directory where you want your Blast output to go: /data/username/blast/results ** WARNING: There are already files in /data/username/blast/results which will be deleted by this job. ** Continue? (y/n) :y BLAST programs: blastn - nucleotide query sequence against nucleotide database blastp - protein query sequence against protein database blastx - nucleotide query translated in all 6 reading frames against a protein database tblastn - protein query sequence against a nucleotide database translated in all 6 reading frames tblastx - 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide database blastpgp - PSI-BLAST protein query against protein database Which program do you want to run: blastn The following nucleotide databases are available: (or enter your own database with full pathname) nt - all nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG) hs_genome - human genome assembly (Build 33, 14 Apr 2003) est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms patnt - from the patent division of Genbank pdbnt - from the 3-dimensional structures htgs - high throughput genome sequences ecoli.nt - ecoli genomic sequences mito.nt - mitochondrial sequences yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences drosoph.nt - drosophila sequences hs.fna - RefSeq human sequences other_genomic - non-human genomic sequences mouse_genome - mouse genome mouse_masked - mouse genome, masked Database to run against: nt Want a summary file in the output directory? (y/n, default y) : n http://biowulf.nih.gov/blast.html has a full list of available parameters. Any additional Blast parameters (e.g. -v 10): Checking node situation.... Submitting to 20 nodes. Job number is 709061.biobos Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?709061.biobos
Use the URL that EasyBlast gives you to watch your job run. You can see the full range of Biowulf monitors at http://biowulf.nih.gov/sysmon/
Blue: no load
Green: load ~ 1
Yellow: load ~ 2 (i.e. fully utilized)
Red: load > 2 (problem?)
What to expect:
Contact the helix staff (staff@helix.nih.gov or 4-6248) if you have questions or concerns about your job.
What EasyBlast does
Note: To run against your own database, enter its full pathname. e.g.
Database to run against: /data/username/blast_db/my_own_dbwhere my_own_db is a Blast database formatted with the formatdb program. (available in /usr/local/blast/formatdb).
Blastn
Query: 1000 nucleotide EST sequences
Database: NCBI nt nucleotide database (1,431,631 sequences,
1.8 Gb)
Serial blast runs with GCG's Netblast against NCBI server ~ 18 hrs
Silicon Graphics R14000, 4 processors (nimbus.nih.gov) - 5.5 hrs
10 Biowulf p2800 nodes: 33 mins
This program is NOT parallelized on Biowulf. The advantage of using the Biowulf cluster would be to run Repeatmasker on a large number of sequences.
repeatmasker NM_00110* repeatmasker NM_00111* repeatmasker NM_00112* repeatmasker NM_00113* repeatmasker NM_00114* repeatmasker NM_00115* repeatmasker NM_00116* repeatmasker NM_00117* repeatmasker NM_00118* repeatmasker NM_00119*
swarm -f swarmcmd
# # this file is cmdfile # myprog -param a < infile-a > outfile-a myprog -param b < infile-b > outfile-b myprog -param c < infile-c > outfile-c myprog -param d < infile-d > outfile-d myprog -param e < infile-e > outfile-e myprog -param f < infile-f > outfile-f myprog -param g < infile-g > outfile-g |
2. Submit the job via the 'swarm' command.
swarm -f cmdfile
#!/bin/bash -v # This file name is my_script # #PBS -N run1 #PBS -m be #PBS -k oe PATH=/usr/local/mpich/bin:$PATH; export PATH mpirun -machinefile $PBS_NODEFILE -np $np ght < test.in |
qsub -l nodes=1 my_scriptYou can test the job interactively:
biobos% qsub -I -l nodes=1 qsub: waiting for job 664776.biobos to start qsub: job 664776.biobos ready [user@p2 ~]$ cd /data/username/mydir [user@p2 mydir]$ setenv PATH /usr/local/mpich/bin:$PATH [user@p2 mydir]$ mpirun -machinefile $PBS_NODEFILE -np 1 /usr/local/bin/ght < test.in ************************************************************************ * * * GENEHUNTER-TWOLOCUS - A modified version of GENEHUNTER * * (version 1.3) * * * ************************************************************************ Type 'help' or '?' for help. Can't find help file - detailed help information is not available. See installation instructions for details. running on 1 nodes npl:1> 'photo' is on: file is 'two02.out' npl:2> Fri Nov 25 13:54:28 2005 npl:3> Single point mode is now 'off' npl:4> Count recs is now 'off' npl:5> Haplotype output is now 'off' npl:6> Unaffected children are now used. npl:7> Currently analyzing a maximum of 9 bits per pedigree npl:8> Large pedigrees are now used but trimmed. npl:9> The current analysis type is 'BOTH' [...] [user@p2 /data/user/mydir] exit logout qsub: job 668223.biobos completed [susanc@biobos ~]$ |
You must exit the node after an interactive run!