Office of Science
FAQ
Capabilities

User Tips

Check back frequently for tips on how to make your experience easier and more efficient.

Help for Debugging Parallel Programs

Occasionally in the process of debugging, it is useful to have all processes print to stderr or stdout but still to know which line of output came from which process. Adding the -l option to your srun command prefixes each line of output with the process id that generated it for HPMPI and MUAPICH. For Intel MPI add the -l to the mpirun command. Voltoire MPI does not support this. For example if hello.x is a program that prints "Hello World!" to stdout ("WRITE(6,...)" or "WRITE(*,...)" or "PRINT *,..." in FORTRAN), then using-l would produce the following output:

	0 Hello World!
 	1 Hello World!
 	2 Hello World!
	3 Hello World!
	
NWfs transfers using sftp

When transferring individual files or tape-archive (.tar) files to the NWfs file system from Chinook, the fastest way is to use the secure file transfer protocol (sftp). The trick is to use your PIN + passcode for the password. For example:

  1. Change directory to the one containing the files to be transferred.
  2. Enter the command: sftp nwfs.emsl.pnl.gov
  3. When asked for your Name, just push the Enter key
  4. When asked for the password, enter your PIN + SecurID passcode number.

You will then be able to use the standard sftp commands you are used to (typing help gives an abbreviated list of commands).

The /home file system is shared by all users of Chinook. Please do not leave old or temporary files such as: output files you have analyzed, debugging output, scratch files and core files, tar balls or out of date executables in your /home directory. Files you need to save for long periods of time should be moved to our long term storage system NWfs.

Monitor output from Chinook Jobs
When running large jobs either with many processors or for long run times, you should regularly check your output files to make sure that the processors are producing useful results. User errors, incorrect input, wrong binary, mistakes in the job script, and occasionally a system problem, could lead to wrong results or no output at all. If you find a problem with your job, the job can be stopped using the "cancelj.b" command followed by the job ID. The job ID is listed in the "showq" command or you can use the "mschowq," "qstat," or "squeue" commands.
NWChem, ADF, MOLPRO and GAMESS users

Run your parallel jobs with scripts provided by the consultants to get the latest installed version with the correct libraries and environment variable settings. All of the following scripts have an interactive mode that will prompt the user for the inputs they need (such as GOLD account number, number of processors to run on, filename of an input deck, etc). Input file formats for the various codes will be specified in the corresponding user's manual/web documentation. Each script also has a command-line-user-interface (clui). Running one of the scripts with only the -help argument will document that command's clui.

Protect your files with long-term storage! It is good for everyone

Keep the /home and /dtemp file systems clean by regularly copying your necessary files to long-term storage. Remember that /dtemp is not a permanent storage space, and may be cleaned periodically. Keeping too many files on /home can clog the file system and make things difficult for everyone. You should periodically remove files you do not need and protect the ones you do by sending them to nwfs.

You can request an account on nwfs if you do not already have one by contacting your PI (or by requesting an account using IOPS if you are an internal user). NWFS is a permanent file system which is backed up daily. If you will need more than 500 GB of space for storing your files, let us know when you request your NWFS account. The nwfs file system is the right place to keep your important files to protect and preserve them.

LUSTRE tip for users not currently using the /dtemp file-system on Chinook

There are several file systems available for use on Chinook. See Chinook Details - File Systems for an overview.

  • As part of your submission script (msub) do you copy files from your home directory or from the afs file-system (/msrc) to /scratch directories before running your job and then copy output files from /scratch back to your home directory? Or do you read and write files to the /home file system directly from your parallel job?

If this describes your use of Chinook, then the following procedure will help you make more efficient use of Chinook resources.

  • From a login node, copy your input files and executable into a directory in /dtemp.
  • In your submission (msub) script, copy them from /dtemp into /scratch.
  • Move your output files from /scratch back to /dtemp in your submission script.
  • After your job has exited, move short output files you wish to keep from /dtemp to your /home directory and large ones to the archive system, nwfs.

The reason for wanting to do this is that /dtemp accesses are much more efficient than accessing files in /home or /msrc on Chinook.

LUSTRE tip for those already using the /dtemp file-system on Chinook

The LUSTRE file system is mounted on /dtemp because it is a temporary file-system that is not backed up. While it is persistent between jobs, it does get cleaned at intervals. We try to give at least 2 weeks notice before these events, but on occasion it may be only one week from notification, to cleaning. If you have 6 terabytes (6 trillion characters) of files in dtemp it may be difficult to move them all to an archive in such a short time.

You should move output files that are important to be saved out of /dtemp into the archive (nwfs) on a regular basis so as to minimize the effort involved in upcoming cleaning of the LUSTRE file system and also to better share the LUSTRE resource with other users. Never store development code or your only copy of data only in /dtemp.

These are the most common difficulties with logging in to Chinook:
  1. "I try to log in, but I just keep getting asked for my passcode!"
    • Make sure you are using a SSH client that supports secure shell protocol 2
    • Configure your SSH client so that the authentication method is "keyboard interactive" only
    • Use your Chinook user name (not your Hanford ID)
    • The server will lock you out if you try too many times without successfully logging in.
    • If you believe you are locked out, send an email to mscf-consulting@emsl.pnl.gov.
  2. "I set my PIN, but I cannot make it work now!"
    • SecurID® cards generate ONE-TIME-USE passcodes! If you used a six digit code from your SecurID® card to gain access to Chinook, set your pin, then used your new pin with the same six digit code, your authentication will fail. You have to wait until a new six digit code is present on the SecurID® card before you can log in.
    • The server will lock you out if you try too many times without successfully logging in.
    • If you believe you are locked out, send an email to mscf-consulting@emsl.pnl.gov.
  3. Lost SecurID® cards. Security policy for Chinook usage PROHIBITS PASSWORD OR SecurID® PASSCODE SHARING! Even if the person you are sharing with already has their own account on Chinook, you may NOT SHARE your SecureID® card, your PIN, or your other passwords. Please send an email to mscf-consulting@emsl.pnl.gov and request a SecurID® card.
Is your batch job really running? You can log onto batch nodes and check.

Once a batch job has started on Chinook, the nodes running your job are available for interactive logins using the ssh protocol. Find those node numbers with "showq -u <userID>."

Logon to a node with the command "ssh m###" where ### is the node number.

You can check what has been written to the local /scratch directory: "ls -l /scratch
This will list the size of the files written.

Check which processes are running: "top"
This will give a long list, type 'n' followed by '12' to see just the top 12 processes.

Check the messages file for that node: "tail -30 /var/log/messages"
If you see messages like "floating point assist" your code is running into very small numbers or nan's (not a number) that take extra CPU cycles to process. Compiling with the -ftz flag will set very small numbers to zero and speed up your code (this could cause problems if you need to process very small numbers).

Check memory usage over a time interval: "vmstat 5 6"
The first line will show average values since the node was rebooted, subsequent lines will give a 5 second average (there will be a total of 6 lines generated).

When good programs go bad: what to do when you need help
When you have an unexpected problem with your account or with compiling or running your code on Chinook, the scientific consultants' queue is a good resource for solutions. Just email your questions/problems to mscf-consulting@emsl.pnl.gov. So that we can help you make the most of your computer time, there are a few things you can do to help us give you a quick solution to your problem. When you send an item to the queue because of trouble with running code, make sure to include your username, the job ID#, your GOLD account name, any submission script you may have used, the EXACT error(s) you are receiving, the name and location of your executable, and the location of all input and output files associated with the problem job. Also make sure that these files and the directories they are in are read-enabled for users other than you. Use the command " chmod a+r * " from within the directories where your trouble code is located and use " chmod a+rx ./ " to allow us to read the directory itself. (If you have subdirectories, they will also need to have their protections set in this way). You should also put the commands "printenv" and "ldd <your executable>" in your msub script so that when you have a problem, your output files will contain the libraries your code was using, and what your run-time environment was.
Tired of waiting in the Queue? There may be hope! Getting your jobs to run sooner by backfilling.
Jobs are scheduled on Chinook based on the number of processors required as well as the wall clock time requested. Chinook uses a backfill queue. That is, the top pending job which is "small enough" to fit in an available window will be scheduled. Many times it may appear that a job requesting similar resources as yours is "skipping over" you in the queue. Your jobs will only be skipped when a backfill window becomes available which is too short or containing too few processors for your job to run. To take advantage of the backfill capability of Chinook, make sure you only request the time you actually need to run your job (plus a little bit for a safety margin). If you ask for twenty hours when you only need 5, your job will wait much longer before it runs.