|
NERSC Announcements Message Archive
Select:
[all-announcements]
[users]
[franklin]
[bassi]
[jacquard]
[davinci]
[nug]
[managers]
[ Back ]
Subject: |
Potential problems with Seaborg batch jobs |
Author: |
David Turner <dpturner_at_lbl.gov> |
Date: |
2005-04-05 09:44:01 |
Greetings Seaborg User,
NERSC has identified and fixed a configuration problem on Seaborg that
could possibly affect batch jobs submitted between 14:00 March 22 and
15:45 April 4. Jobs submitted during this interval could experience
either of the following:
1) Job failure due to insufficient memory for MPI operations.
Two possible error messages are:
ERROR: 0032-171 Communication subsystem error: Memory is exhausted. in
MPI_Isend, task 0
ERROR: 0032-113 Out of memory in MPI_Allreduce, task 51
Whether or not a particular program experiences this type of
failure depends on the nature of its MPI operations; not all
MPI codes will encounter this failure.
2) Reading large files via stdin (standard input) will result in
unpredictable results. Input files over 1024 bytes in size will
not be read correctly. Depending on the program's logic, this could
result in code failure, or more seriously, incorrect results.
Situation 2) requires immediate user attention. If you have run to
completion any batch job submitted during the interval in question,
that used stdin to read a file larger than 1024 bytes, you should
look very closely at your results; they may not be correct. If you
have any pending batch jobs (status I, NQ, HS, or HU) that were
submitted during this interval, and that expect to use stdin to read
a file larger than 1024 bytes, you should cancel those jobs and
resubmit them.
We apologize for the inconvenience this problem causes for our users.
NESRC staff are actively working with IBM to prevent this problem in
the future.
--
Best regards,
David Turner
User Services Group email: dpturner@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Lab fax: (510) 486-4316
|
|