NERSC Announcements Message Archive
Select:
[all-announcements]
[users]
[franklin]
[bassi]
[jacquard]
[davinci]
[nug]
[managers]
[ Back ]
Subject: |
[franklin-users] memory hardware problems on Franklin |
Author: |
Francesca Verdier <fverdier_at_lbl.gov> |
Date: |
2008-11-14 15:45:44 |
Dear Franklin users,
We are experiencing Uncorrectable Memory Errors (UMEs) on some Franklin
nodes that can cause the node to crash. As we discover these errors,
the affected nodes are removed from service. A user whose job is
affected by such an error will see the following type of error message:
"Apid xxxx killed. Received node failed or halted event for nid xxxx"
We will refund jobs affected by this problem (although there may be a
delay of several days before the refund is given.).
There have been 67 UMEs on Franklin over the past 2 months (on average
about one per day), although they have not all resulted in node crashes.
We are working with Cray to understand which nodes are affected and to
develop a plan for fixing the problem, which we will communicate to you
once more details are known.
Sincerely,
--
Francesca Verdier email: fverdier@lbl.gov
Department Head, NERSC Services phone: 510-486-7193
_______________________________________________
franklin-users mailing list
franklin-users@nersc.gov
|
|