NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

NERSC Announcements Message Archive

Select: [all-announcements] [users] [franklin] [bassi] [jacquard] [davinci] [nug] [managers]

[ Back ]

Subject: [franklin-users] memory hardware problems on Franklin
Author: Francesca Verdier <fverdier_at_lbl.gov>
Date: 2008-11-14 15:45:44
Dear Franklin users, We are experiencing Uncorrectable Memory Errors (UMEs) on some Franklin nodes that can cause the node to crash. As we discover these errors, the affected nodes are removed from service. A user whose job is affected by such an error will see the following type of error message: "Apid xxxx killed. Received node failed or halted event for nid xxxx" We will refund jobs affected by this problem (although there may be a delay of several days before the refund is given.). There have been 67 UMEs on Franklin over the past 2 months (on average about one per day), although they have not all resulted in node crashes. We are working with Cray to understand which nodes are affected and to develop a plan for fixing the problem, which we will communicate to you once more details are known. Sincerely, -- Francesca Verdier email: fverdier@lbl.gov Department Head, NERSC Services phone: 510-486-7193 _______________________________________________ franklin-users mailing list franklin-users@nersc.gov

LBNL Home
Page last modified: Fri, 05 Dec 2008 19:17:25 GMT
Page URL: http://www.nersc.gov/nusers/announcements/message_text.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science