Skip all navigation and jump to content Jump to site navigation Jump to section navigation.
NASA Logo - Goddard Space Flight Center + Visit NASA.gov
NASA Center for Computational Sciences
NCCS HOME USER SERVICES SYSTEMS DOCUMENTATION NEWS GET MORE HELP

 

Documentation
OVERVIEW
GENERAL SOFTWARE INFO
HALEM
DISCOVER
PALM/EXPLORE
DIRAC

More Discover links:

+ User Guide

+ User FAQ

NASA/NCCS LNXI/Discover USER FAQ


Index


Q1: What modules are loaded by default on login?
Q2: Which MPI modules are available?
Q3: How do I get Scali MPI (mpi/scali) to work with Fortran 90?
Q4: How do I use ncpus in PBS?
Q5: How do I control the number of processes used by MPI applications in PBS?
Q6: What does the mpirun option "-q" do?
Q7: The Intel Math Kernel Library, MKL, doesn't seem to work for me.
Q8: Where do I find BLAS, LAPACK, etc.?
Q9: How do I handle undefined references in libcxa?
Q10: What default symbols and tokens should I use on discover for conditional compilation?
Q11: What's the problem with feupdateenv?
Q12: My job quit with the error "process-0 terminated abnormally" being printed out by mpimon. How can I fix this?
Q13: The permissions I have set with umask don't seem to be respected by mpirun/PBS. How do I get around this?
Q14: Where can I see the stdout/stderr files produced by PBS during the course of a run?
Q15: My application generates a lot of data on stdout, how should I handle this?
Q16: Why does a 1-node, 4-core run using MKL seem to be running 16 threads?
Q17: How should I configure the environment to optimally use swap?
Q18: I am asked for my password a second time, or I am asked to change it, but after I go through the process it doesn't seem to change. What do I do?
Q19: Will the tcsh command line length be increased beyond its current 4K limit?
Q20: Will you install PAPI?
Q21: How do I determine how much data storage (disk space) I am using or have left in my quota-controlled directories?
Q22: When I ssh to a compute node from the login node, I get prompted for a password, but I can't seem to log in. Is something wrong?
Q23: How do I use Etnus's Totalview?
Q24: Is there scratch space like Halem's /scr on Discover?
Q25: I need to run "cvs update" at the beginning of my batch job. How do I execute "cvs update" from batch?
Q26: When I run "mpirun -n <number-of-processes> <myProgram>", why do I get a license check failure?
Q27: What is LDAP?
Q28: How do I list all the PBS queues on Discover?
Q29: How do I get the details of the particular queue?

Q1: What modules are loaded by default on login?

A1. No default modules are loaded upon login. Explicitly load the modules that you need.

| Top of Page |


Q2: Which MPI modules are available?

A2. Please invoke 'module avail' to determine which MPI implementations are installed and available for testing and production.

| Top of Page |


Q3: How do I get Scali MPI (mpi/scali) to work with Fortran 90?

A3. At present Scali MPI does not have a Fortran 90 interface, applications with "use mpi" will need to "include mpif.h" instead.

| Top of Page |


Q4: How do I use ncpus in PBS?

A4. For PBS, use "select" and "ncpus". For example, in an interactive batch session:

qsub -V -I -l select=4:ncpus=4,walltime=1:00:00

The above requests 4 nodes and 4 processors per node.

mpirun -np 4 ...

This will launch 4 processes. Do not use the "-npn" option of mpirun. Please also see the question on controlling the process number.

The -V option of qsub ensures your environment is exported to the PBS session.

| Top of Page |


Q5: How do I control the number of processes used by MPI applications in PBS?

A5. To ready a portion of the cluster for an application run, PBS and MPI must work together, and our aim is to have the different implementations of MPI show the end user the same behavior. Therefore, please use PBS's -l select=:ncpus= to request nodes and cpus and MPI's mpirun -np to set the number of MPI processes to start. Do not use mpirun's -npn option. Summarizing:

# Do this (or the like via qsub)...
#PBS -l select=:ncpus=

.

# And do this...
mpirun -np ...

In general, can range from 1 process to the product processes.

For MPI jobs, the process is to request nodes and processors through PBS and to use "mpirun" to launch jobs. In both places, it is possible to optionally indicate that one wants more than one process per node. It is important to recognize that if one exercises this option in both places, one can unintentionally try to oversubscribe each node.

If you require assistance implementing a particular topology of MPI processes, please contact NCCS user support.

| Top of Page |


Q6: What does the mpirun option "-q" do?

A6.It suppresses default warning messages.

| Top of Page |


Q7: The Intel Math Kernel Library, MKL, doesn't seem to work for me.

A7. The major version of the MKL module must match the major version number of the compiler, i.e. lib/mkl-9.0.017 for Intel's version 9 compilers (comp/intel-9...).

| Top of Page |


Q8: Where do I find BLAS, LAPACK, etc.?

A8. Load the module for and link against Intel's Math Kernel Library (lib/mkl-9...).

| Top of Page |


Q9: How do I handle undefined references in libcxa?

A9. If during link you see:

/usr/local/intel/fce/8.1.032/lib/libcxa.so.5: undefined reference to `_uw_parse_lsda_info'

/usr/local/intel/fce/8.1.032/lib/libcxa.so.5: undefined reference to `_dw2_size_of_encoded_value'

/usr/local/intel/fce/8.1.032/lib/libcxa.so.5: undefined reference to `_dw2_read_encoded_value'

/usr/local/intel/fce/8.1.032/lib/libcxa.so.5: undefined reference to `_ReadULEB'

/usr/local/intel/fce/8.1.032/lib/libcxa.so.5: undefined reference to `_ReadSLEB'

Try adding "-lunwind' to your LDFLAGS

| Top of Page |


Q10: What default symbols and tokens should I use on discover for conditional compilation?

A10. If your code relies on default symbol tokens like __ia64__ (for Itanium for example), then use __x86_64__ on discover (EMT64). For example, a line like

#if ((defined(__ia64__) || defined(__i386__)) && defined(__linux__) )

can be replaced with

#if ((defined(__ia64__) || defined(__i386__) || defined(__x86_64__)) && defined(__linux__) )

| Top of Page |


Q11: What's the problem with feupdateenv?

A11. If you see the warning:

/usr/local/intel/fce/9.1.036/lib/libimf.so: warning: warning: feupdateenv is not implemented and will always fail

Add -i_dynamic to ifort options.

| Top of Page |


Q12: My job quit with the error "process-0 terminated abnormally" being printed out by mpimon. How can I fix this?

A12. If you see

--- mpimon --- Aborting run after process-0 terminated abnormally Childprocess 4308 got signal SIGSEGV(11): segmentation violation ---

while running, try the following.

  1. Reset stacksize to a large value - "unlimited" will do. We routinely do this on other systems too.
  2. Get "mpimon" through "mpirun" to inherit the newly defined limit i.e.

mpirun <mpirun_options> -inherit_limits <other_mpimon_options> <program_name> <program_options>

| Top of Page |


Q13: The permissions I have set with umask don't seem to be respected by mpirun/PBS. How do I get around this?

A13. PBS honors umask just fine, but if you create a file from within your mpirun job, umask is not honored. Do a chmod after your mpirun to assure that any files created during your mpirun have the permissions you desire.

| Top of Page |


Q14: Where can I see the stdout/stderr files produced by PBS during the course of a run?

A13. /discover/pbs_spool is a 200 GB GPFS filesystem that is a globally visible spool dir. The local spool directory on all compute nodes is now a sym-link that point to this global spool dir. You should be able to monitor job err/output by going to this directory and finding the appropriate files by their jobids. As with the SGIs, users should not edit or remove any files in this directory or unpredictable things may happen. This directory is for PBS use and can be used to monitor jobs: any non-PBS files that show up there are subject to deletion at any time and without warning. [fr. WMP]

The intermediate output files have names such as <job-number>.<node-of-submission> .OU, for example:

userid@discover01:/discover/pbs_spool> ls

1008.borgmg.OU   1224.borgmg.OU   1249.borgmg.OU   1390.borgmg.OU  1628.borgmg.OU
1036.borgmg.OU   1225.borgmg.OU   1256.borgmg.OU   1396.borgmg.OU  1705.borgmg.OU

| Top of Page |


Q15: My application generates a lot of data on stdout, how should I handle this?

A14. Write large text files to /nobackup/, do not use stdout. The PBS output spool, /discover/pbs_spool, is not set up for I/O performance or for handling large stderr/stdout files. It is expected that small amounts of text-only output will be written here (and moved back to submission directories at the conclusion of a job. If users have large text I/O requirements, they should be writing directly to a file on /nobackup//* and not using stdout. [fr. WMP]

| Top of Page |


Q16: Why does a 1-node, 4-core run using MKL seem to be running 16 threads?

A16. The default value of the environment variable OMP_NUM_THREADS may be set so that each MKL process is trying to grab time on each core. Check the value of OMP_NUM_THREADS and set it to one. See the following, for example.

export OMP_NUM_THREADS=1

setenv OMP_NUM_THREADS 1

| Top of Page |


Q17: How should I configure the environment to optimally use swap?

A17. Do not use swap. Node performance slowdowns while swaping are not graceful. It is recommended that you configure applications so that they use less than about 3.2 or 3.5 GB. The compute nodes have 4 GB of physical memory, but some is required by the operating system. For a little extra protection, a user may use limit or ulimit to set limits on virtual memory use so that jobs requesting more than these limits will be terminated. Otherwise, jobs that incur swapping expending one's supercomputer allocation in IO wait states. In bash, one can invoke

ulimit -v 3145728 -m 3145728

to set the limit to 3 GB (in units of KB), or in csh try the following.

% limit vmemoryuse 3145728

% limit memoryuse 3145728

| Top of Page |


Q18: I am asked for my password a second time, or I am asked to change it, but after I go through the process it doesn't seem to change. What do I do?

A18. We are studying this problem. In the meantime, after you get to the interactive prompt on Discover after login, try changing your password by invoking passwd at the shell prompt. Invoking passwd manually seems to workaround this problem.

| Top of Page |


Q19: Will the tcsh command line length be increased beyond its current 4K limit?

A19. We have no plans to increase the shells' command line lengths.

| Top of Page |


Q20: Will you install PAPI?

A20. No.

| Top of Page |


Q21: How do I determine how much data storage (disk space) I am using or have left in my quota-controlled directories?

A21. Please try the command:

% showquota

| Top of Page |


Q22: When I ssh to a compute node from the login node, I get prompted for a password, but I can't seem to log in. Is something wrong?

A22. You cannot This login mode is disabled. The behavior you see is correct. In contrast, you can ssh from one compute node to another, which you can test in a job in an interactive PBS queue.

| Top of Page |


Q23: How do I use Etnus's Totalview?

A23.To use Totalview in the interactive batch, try the following.

  1. Compile your code with the "-g" option to ensure source level debugging.
  2. Set up ssh keys for passwordless connection to the nodes.
  3. Set up the Totalview environment, for example:
       module load tool/tview-8.0.0.0
    
  4. If you are running MPI across more than one node, set the environment variable TVDSVRLAUNCHCMD to ssh.
       export TVDSVRLAUNCHCMD=ssh
    or
       setenv TVDSVRLAUNCHCMD ssh
    
  5. Submit the job with "qsub -V -I ...", so that the DISPLAY environment is passed into the PBS job environment.
  6. There are several ways to launch Totalview.
    • For MPI code using "mpi/scali-5.3", launch Totalview as follows.
         mpirun -tv -np <number-of-processes> <your-executable>
      
      
      The -tv tells Scali to run with Totalview.
    • For sequential code, run Totalview as follows.
         totalview <your-executable>
      
    • For OpenMP code, set the OMP_NUM_THREADS environment variable to the desired number of threads, four or fewer for our 4-core nodes, and launch as follows.
      
         totalview <your-executable>
      

| Top of Page |


Q24: Is there scratch space like Halem's /scr on Discover?

A24. Yes, there is such a file system for large amounts of data needed for run-time use which we call the "nobackup" file system. Files on the "nobackup" file systems are not backed up, but neither are they "skulked" or deleted. The NOBACKUP environment variable has the location of your nobackup directory, for example:

   cd $NOBACKUP
is equivalent to
   cd /discover/nobackup/<username>

For longer term large-file storage that backs up to tape, please use the DMF system, Dirac. The "/discover/home" directories are backed up regularly.

| Top of Page |


Q25: I need to run "cvs update" at the beginning of my batch job. How do I execute "cvs update" from batch?

A25. You can run "cvs update" to access files outside of the cluster by running a script in the "datamove" queue. The "datamove" queue is not for compute-intensive work, yet you can use PBS's ability make one job dependent on another, e.g. the qsub option -W depend= <dependency_list>. Here is a script that illustrates this idea.

    #!/bin/bash
    # Dependent script example.
    # File name: dep-example
    # Usage:  ./dep-example
    #
    jid=`qsub -q datamove ./data-mover-script`
    qsub -q general_high ./compute-driver-script -W depend=afterok:${jid}
    # dep-example end
   

You may need to include your group information on qsub's command line, e.g. "qsub -W group_list=<my-group-id> ...".

| Top of Page |


Q26: When I run "mpirun -n <number-of-processes> <myProgram>", why do I get a license check failure?

A26. Please use "-np" instead of "-n". Scali MPI's mpirun is a compatibility wrapper for their mpimon. Scali's mpirun supports "-np" for specifying the number of processes, but "-n" gets passed to mpimon where it is interpreted as Scali MPI's "-network" option for network devices. The error looks like the following.

Apr 9 12:27:06: (xhello-0@borga127)(9691) Error: [0] No valid network connection from 0 to 1
- License check for network 6 (dat3pt 1.00) failed: Feature dat3pt, version 1.00 not found in "/opt/scali/etc/license.dat"

Feature dat3pt, version 1.00 not found
- Contact license@scali.com to request or check a license

| Top of Page |


Q27: What is LDAP?

A27. It is an application protocol for querying and modifying directory services running over TCP/IP. LDAP is used by the NCCS for access authentication to NCCS computing resources. Currently an LDAP password is necessary for access to the NCCS Linux Networx (discover) Cluster, the NCCS SGI Altix 3700 BX2 (palm/explore) Cluster and the NCCS Portal Web-pages.

| Top of Page |


Q28: How do I list all the PBS queues on Discover?

A28.

% qstat -q 
   server: borgmg.prv.cube

Queue            Memory CPU Time Walltime Node   Run   Que   Lm  State
---------------- ------ -------- -------- ---- ----- ----- ----  -----
general           xxxgb    --    12:00:00  --     x     0   x    E R
general_small     xx gb    --    12:00:00  --     x     0   x    E R
debug              xxgb    --    01:00:00  --     x     0   x    E R
pproc              xxgb    --    03:00:00  --     x     0   x    E R
datamove           --      --    01:00:00  --     x     0   x    E R
background        xxgb    --     04:00:00  --     x     0   x    E R
visual            xxxgb    --    06:00:00  --     x     0   x    E R
                                               ----- -----

| Top of Page |


Q29: How do I get the details of the particular queue?

A29.

% qstat  -Qf  < Name of the Queue , e.g general > 

| Top of Page |



FirstGov logo + Privacy Policy and Important Notices
+ Sciences and Exploration Directorate
+ CISTO
NASA Curator: Mason Chang,
NCCS User Services Group (301-286-9120)
NASA Official: Phil Webster, High-Performance
Computing Lead, GSFC Code 606.2