NERSCPowering Scientific Discovery Since 1974

2010 PDSF Users Meeting Minutes

  • December 21 PDSF Users Meeting 12/21/10 Attending:  Eric and Jay from NERSC and users Andrei and Jeff P. Cluster status and utilization:  Cluster has been loaded to capacity recently.  STAR is running a lot of jobs, many of the grid-based and submitted from BNL.  ALICE, ATLAS and icecube also running. Outages and Downtimes:  There was an NGF downtime on the 16th, otherwise things have been stable for the most part. Procurements and New Hardware:  Will get more storage for kamland.  It will be an…
  • December 7 PDSF Users Meeting 12/7/10 Attending: Eric, Katie and Jay from NERSC and users Andrei, Yushu, Thomas, Jeff P., Craig, Joanna Cluster status: Cluster has been full most of the time and is full today.  STAR and ALICE running a steady stream of grid jobs. Outages: Yesterday there were GPFS problems related to the kernal issue on some nodes that had not yet been upgraded.  This prevented interactive logins for a while. Upcoming downtimes: At some point there will be downtime for home and…
  • November 23 PDSF Users Meeting 11/23/10 Attending: Eric, Katie and Jay from NERSC and users Jeff P., Craig and Marjorie. Cluster status: Cluster has been relatively full, primarily STAR, ALICE and ATLAS. Outages: There was a power outage Saturday related to the stormy weather that brought down some nodes. eliza16 had some configuration issues related to a kernal issue. Upcoming downtimes: At some point there will be downtime for home and common replacement. New hardware: 3 new file systems are available…
  • November 9 PDSF Users Meeting 11/9/10 Attending: Eric, Katie and Jay from NERSC and users Jeff P., Craig, Lisa, Jeff A., Cheng Ju, Marjorie and Maxim. Cluster status: Cluster has been full Up to 1200 cores now. Outages: Yesterday there were GPFS problems related to configuration problems with GPFS on the new nodes. There were also two outages due to security problems. Upcoming downtimes: At some point there will be downtime for home and common replacement. New hardware: eliza 11, 12 and 13 are being…
  • October 26 PDSF Users Meeting 10/26/10 Attending: Eric and Elizabeth from NERSC and users Jeff P., Jeff A., Marjorie, Thomas and Joanna. Cluster status: Cluster has been full for the most part, mostly STAR, ATLAS and ALICE Outages: There was a security problem which came up late last Friday and required logins to be blocked. It ended up being fairly serious and things didn't start coming back until the next day. There were some comments about users, mostly at CERN, not knowing what the situation was.
  • October 12 PDSF Users Meeting 10/12/10 Attending: Eric, Jay and Katie from NERSC and users Craig Jeff P., Jeff A., Marjorie and Andrei. Cluster status: Cluster usage is fairly heavy but not filled to capacity for the most part. Outages: There have been problems with slowness which is related to a particular ATLAS user's jobs starting. It's not clear why his jobs are so bad and further testing is needed. pdsff3 went down yesterday which was related to the kernel bug. Were down for a day recently due to a…
  • September 28 PDSF Users Meeting 9/28/10 Attending: Eric and Jay from PDSF and users Ke Han, Joanna, Marjorie, Thomas, Shane. Cluster status: Cluster has been pretty full last few weeks, mostly STAR and ALICE. There are a lot of astrogfs jobs pending but they have no share so they don't run. Outages: NERSC-wide outage recently due to a security problem. PDSF outage this morning - logins were hanging. The memory problem was fixed but another kernel patch is needed. Upcoming downtimes: Nothing scheduled but…
  • September 14 PDSF Users Meeting 9/14/10 Attending: Eric and Jay from PDSF and users Jeff P., Joanna and Marjorie. Cluster status: Cluster is well utilized, primarily by STAR and ALICE. Discussed ALICE memory requirements of 4GB for now. Outages: Some problems with jobs using up kernel buffers - mainly ALICE - which requires a reboot. The fix has been identified (kernel patch) and is being done. Upcoming downtimes: Nothing scheduled but will do new home and common at some point. New hardware: Getting…
  • August 3 PDSF Users Meeting 8/3/10 Attending: Eric and Jay from PDSF and users Oleksandr and Jeff P. Cluster status: There was the outage last week but recently the cluster has been pretty full with 8 groups running now. Outages: Some problems getting the IDL license server to power up after the outage last week. Also lost a few old nodes that didn't power up. /eliza3 lost 1 of 2 controllers - waiting for a replacement. Rack 9 and 10 have been upgraded to SL5 but there were disk problems and /export…
  • July 20 PDSF Users Meeting 7/20/10 Attending: Eric, Iwona and Jay from PDSF and users Andrei, Jeff P., Tom, Ke Han, Jeff A. Cluster status: Utilization has been moderate. Outages: Tere were problems with pdsf4 due to heavy io. Upcoming downtimes: The big shutdown is now scheduled to start July 27 at 7am and last through Friday afternoon the 30th. New hardware: Jay is waiting for reply frrom MS about ATLAS storage. ALICE storage expected to arrive soon. Other Topics: - We hit the inode limit on…
  • July 6 PDSF Users Meeting 7/6/10 Attending: Eric and Jay from PDSF and users Andrei, Craig and Art. Cluster status: Utilization has fairly heavy lately. Outages: There was a recent outage with /home and ALICE jobs - not really clear what is going on - still debugging Upcoming downtimes: The big shutdown is now scheduled to start July 27 at 7am and last through Friday afternoon the 30th. New hardware: Jay got quotes from Dell for new nodes. Latest configuration Intel Westmere chips with 6…
  • June 8 PDSF Users Meeting 6/8/10 Attending: Eric and Jay from PDSF and users Andrei and Jeff P. Cluster status: Utilization has been fairly light for the most part although the cluster is full today. Outages: There were some gpfs problems causing 32sl44 problems and STAR login problems. Also user sethzenz was causing some problems on the interactives. Upcoming downtimes: in June/July will have multiple days center-wide outage. We will also need to upgrade gpfs, etc., after the downtime. New…
  • May 25 PDSF Users Meeting 5/25/10 Attending: Eric and Jay from PDSF and users Andrei, Jeff P. and Craig. Cluster status: Cluster has been full recently. Outages: None. Upcoming downtimes: in June/July will have multiple days center-wide outage. New hardware: Still planning another procurement for this fiscal…
  • May 11 PDSF Users Meeting 5/11/10 Attending: Eric and Jay from PDSF and users Andrei and Craig. Cluster status: Utilization has been fairly light. Outages: There was a problem when a user caused SGE to hit its maximum of 30k jobs. This prevented other users from submitting jobs. It took most of the day to carefully resolve the issue. Upcoming downtimes: in June/July will have multiple days center-wide outage. New hardware: Will star ordering new Intel Westmeer chip with 6 cores/socket and 12…
  • April 27 PDSF Users Meeting 4/27/10 Attending: Eric and Jay from PDSF and users Marjorie, Jeff P, Andrei and Craig. Cluster status: Utilization has been light. Outages: chos problem on interactives has been fixed (java vm can use a lot of memory). Upcoming downtimes: in June/July will have multiple days center-wide outage - concerns were expressed about the timing - Jay to send out an email. New hardware: Waiting for Dell to start selling new Intel chips. ATLAS storage is up and running. 40 new ALICE…
  • April 13 PDSF Users Meeting 4/13/10 Cluster status: Lots of jobs running with 9 groups running in the past 24 hours. There was a big dip at one point because SGE reporting got turned on. Outages: There were problems with interactive nodes crashing to a chos/SL5 problem. Upcoming downtimes: Will do a rolling upgrade of interactive soon (15th). New hardware: New ATLAS disk up (eliza1) - need to check failover... cables were delayed. Setting up ALICE nodes now... Need to change shares to reflect new…
  • March 30 PDSF Users Meeting 3/30/10 Cluster status: Utilization has been fairly light. STAR needs more data. Outages: Discussed the outage last Wednesday. Upcoming downtimes: Next Tues memory consumables will be turned on. Website migration will happen in the second week of April. New hardware: New storage is in for /eliza2; waiting for a couple cables from Dell it will be in production for ATLAS. Discussed trying to coordinate 2 procurements/year. Retirement plan: All 32-bit nodes are gone now. …
  • March 16 PDSF Users Meeting 3/16/10 Cluster status: Utilization has been moderate, cluster not loaded to capacity. Outages: None. Upcoming downtimes: Eliza4 work will be scheduled. Also, memory will become a consumable resource in SGE - users will need to set memory limit if not the default 1GB. New hardware: Working on ATLAS storage and ALICE nodes. Retirement plan: Racks 21, 22, 28 are gone. Upgrading rack 24 to SL5. Will be taking out ract 25 and 26 and then we won't have any 32-bit nodes. ATLAS…
  • March 2 PDSF Users Meeting 3/2/10 Cluster status: Utilization has been heaviest in 6 months, loaded to capacity at times (~1000 jobs) Outages: A few problems with some individual batch nodes but otherwise nothing. Upcoming downtimes: None. New hardware: ATLAS storave and ALICE nodes should show up in mid March. Retirement plan: Didn't get started on this yet. ATLAS grid activities. No news to report. SL5: still only on pdsf1 and pdsf5 Squid/Fuse: No new news. Other Topics: - STAR project…
  • February 16 PDSF Users Meeting 2/16/10 Cluster status: Utilization has been light, mostly icecube and STAR. Outages: None Upcoming downtimes: None, but PDSF will be short-staffed for a while with Iwona taking leave and Jay and Eric taking some vacation time. New hardware: ATLAS storave and ALICE nodes in progress, still working on some details related to ALICE storage. New db nodes finally arrived today. Retirement plan: Jay sent out a document a couple weeks ago - next week we will start retirement.
  • February 2 PDSF Users Meeting 2/2/10 Attending: Eric, Jay and Iwona from PDSF and users Andrei, Marjorie, and Jeff P. Cluster status: Utilization has been light Outages: None Upcoming downtimes: None, but Jay mentioned sending a list of nodes for retirement to the pdsf-pi email list. New hardware: Data transfer node has been reaced and has power and installation should be completed this week We will keep the old node alive for a while. ATLAS storage order has been placed. /eliza11 will be retired at…
  • January 19 PDSF Users Meeting 1/19/10 Attending: Eric and Iwona from PDSF and users Andrew, Andrei, Marjorie, Craig, Jeff A. and Jeff P. Cluster status: Utilization has been light. STAR mentioned that they are limited by the io resources being set as low as they are and raising them was discussed. Outages: Discussed the recent issues with pdsf1 and pdsf2. Upcoming downtimes: None. New hardware: Data transfer node is being shipped. eliza11 is going away so ATLAS is buying a replacement. This turned into…
  • January 5 PDSF Users Meeting 1/5/10 Attending: Eric and Jay from PDSF and users Andrew, Andrei, Marjorie, Craig, Jeff A. and Jeff P. Cluster status: Utilization has been very light, mostly STAR jobs. There was a question about the status of /eliza1 and whether it was truly gone and it is. Apparently some users didn't have their data backed up and lost it (despite announcements). Outages: None. Upcoming downtimes: System-wide outage has been rescheduled for 1/11/10. STAR and ATLAS work scheduled were…