|
Operations Schedule for PDSF
In case of difficulty accessing PDSF or HPSS
please check one of the following:
PDSF Scheduled Shutdowns
NERSC Systems
Message of the Day - includes PDSF and HPSS.
Message of the Day, know also as MOTD
is maintained 24 hours a day, 7 days a week, by the NERSC Computer
Operations & Network Support staff, and provides the most up-to-date
information on system status. Just recently PDSF has been added to this
system.
NERSC Systems
Availability Log - includes HPSS and PDSF.
History of down times for all the NERSC systems - updated once a day.
Good place to check if you wonder whether
PDSF or HPSS was down or up some time in the past
Scheduled Events
- December 10th 10:00am - 2:00pm
- Cluster maintenance.
- November 7th 10:00am - 5:00pm
- Batch system suspended for the cluster benchmarking.
- September 12th 2002 9:00am - 6:00pm
- Upgrade to RH 7.2. There will be no logins at that time. There will be no
queue draining prior to that event. Instead, the queues will be stopped at 9:00am on 09/12/02
and all the running jobs will be killed to give LSF a chance to clean up the spool,
as we will be migrating spool to a new location. This should not affect pending jobs.
- September 5th 2002 9:00am - 5:00 pm
(with a possible overflow into September 6th)
- Installation of new hardware and seismic bracing. During that time there will be
no logins and the batch system will be suspended, but no jobs should die.
- August 15th 2002 6:30am - 7:30am
- Upgrade of pdsfsu00 and pdsfsu05, afs services will be down across the cluster
at that time except for pdsflx007 and pdsflx008.
- April 30th 2002 12 pm PDT - PDSF mail
system will stop using any .forward files you have on your PDSF account.
e-mail sent to you@nersc.gov will be delivered to the address that is in the
NERSC database (nim.nersc.gov).
You can update that database yourself.
Addresses specified in bsub will work as before.
- April 29th 2002 12 pm PDT - HSI upgrade,
see MOTD for details.
- April 23th 2002 9 am - 1 pm PDT - LSF batch system suspended.
Jobs will pause and resume when tests are done. There will
be no draining associated with it and users are allowed to queue jobs at
that time. Interactive nodes will not be affected.
- March 13th 2002 5pm-6pm PDT network interruptions.
On Wednesday, Mar 13, from 5-6pm, there will be several network
interruptions.
ESnet will be changing the OC12 link from the OSF to Sunnyvale from
ATM to POS. We expect 2-4 network interruptions of less than 10
minutes each. During the interruptions, there will be no traffic
to/from NERSC. Internal traffic on the NERSC network will not be
affected.
After ESnet completes their work, the NERSC network team will be
upgrading the switch on the public subnet. While the switch is being
replaced, the NERSC web servers and DNS servers will not be available.
We expect 2 periods of downtime of less than 5 minutes each.
If ESnet is unable to perform their work on Wednesday,
the network downtime will be moved to Thursday, Mar 14 from 5-6pm.
If this happens an additional announcement will be sent.
- 19 February 2002 9am-6pm (or until announced)-
scheduled PDSF cluster maintenance, all the CPU's will be restarted.
Following users' requests there will be no queue draining. Jobs running at
the time of shutdown will be terminated. We are planning to bring PDSF
GID's and UID's in sync with the NERSC database. It really is lots of work
for the staff. We'll bring the system back ASAP.
- 1 February 2002 - ftp server on pdsfsu00 will be turned off.
- 17 January 2002 5:30pm - 6:00pm PST
- the LBNL<=>ESnet connection will be subject to disruption to troubleshoot
the source of errors on the link. LBLnet & ESnet personnel will be working
to isolate & hopefully eliminate the source of these problems.
They apologize for any impact this activity may have on your
operations. Service will be fully restored at the conclusion of this
work.
This will affect only our LBNL users.
Especially check on your afs tokens. They could go away.
- 3 January 2002 9am-6pm (or until announced)- scheduled reboot
of the PDSF cluster, all the CPU's will be restarted.
pdsflx000 will be replaced by two new nodes. Following users'
requests there will be no queue draining. Jobs running at the time of
shutdown will be terminated.
- 19 December 2001 10am-2pm - interactive nodes pdsflx001, pdsflx002 and pdsflx003 will be replaced a by new hardware. IP addresses will change. All the nodes will be configured like pdsflx008 is now (standalone RH 6.2).
- 12 December 2001 10am-2pm - interactive nodes pdsflx004, pdsflx005, pdsflx006, pdsflx007 and pdsflx008 will be replaced a by new hardware. IP addresses will change. All the nodes will be configured like pdsflx008 is now (standalone RH 6.2).
- 4 December 2001 9am-10am - network interruption.
- 26 Spetember 2001 9am-6pm (or until announced)- scheduled reboot
of the PDSF cluster, all the CPU's will be restarted. Following users'
requests there will be no queue draining.
- August 6-12 2001 - Scheduled System
Upgrade of the NERSC AFS cell
The storage group is upgrading NERSC AFS cell this week.
Starting on Monday 8/6/2001 they will be moving AFS volumes.
Users should experience no outage during the week.
On Sunday 8/12/2001 there will be a 4 hour AFS outage in the NERSC cell
from 12:00-16:00 PDT to move the AFS databases.
During this time NERSC AFS cell will be unavailable.
- 22 June 2001 10am-11am PDT - NERSC network outage.
The outage is necessary to replace a failed interface in our main router and
for ESnet to upgrade their router's software.
All network connectivity to the nersc.gov domain will be interrupted
including all access to NERSC computational and storage resources as well as
the PDSF cluster. Also, the AFS and DCE servers will be isolated from the
rest of NERSC while the router is down. Although we are allowing one hour
for the outage, we expect the interruption will be much shorter than this.
- 14 June 2001 9am-6pm (or until announced)- scheduled reboot of the PDSF cluster, all the CPU's will be restarted.
- 25 March 2001 9am-12pm - scheduled maintenance of
pdsflx00. At that time the cluster will
be shut down and then restarted. Users requested not to drain queues so
all the jobs still runing at that time will be killed.
- 6 February 2001 9am-12pm - scheduled reboot of the PDSF cluster, all the CPU's will be restarted.
- 22 November 2000 8:00 AM PST - scheduled emergency reboot of
pdsfsu05 . Reboot should take less than 15min.
AFS service on linux nodes will be affected.
- Activities related to the Oakland Move.
- 20 October, Friday - 8 am Long queue shut down.
- 24 October, Tuesday - 8 am Medium queue shut down.
- 26 October, Thursday - 8 am System goes down for packing.
No access to HPSS at this time.
Cluster Network Switch, pdsflx00, and pdsfsu05
moved to Oakland.
NOTE:
Please remember, this means that there will be NO
logins Thursday morning starting
at 8:00 AM. At that time any and all jobs that are still in the
LSF system will be deleted.
Anyone who is logged in or does log in after this time
will be logged off and any work done after this time may be lost.
We will be doing backups in preparations for the Oakland move.
Also note that after Thursday 8:00 AM the data on dv01 - dv10 will
be lost.
When these disk vaults become available again, they will
be in their new configuration. So any job that writes results to the
disk vaults should be reviewed and saved if appropriate to afs, HPSS
or correct area on dv14 or dv15. Affected areas:
pdsfdv01
pdsfdv02
pdsfdv03
pdsfdv04
pdsfdv05
pdsfdv06
pdsfdv07
pdsfdv08
pdsfdv09
pdsfdv10
This will affect the following mount points:
auto/amanda
auto/atlas
auto/babar
auto/cdf
auto/d0
auto/deepsrch
auto/e895
auto/e896
auto/na49
auto/pdsfdv01
auto/pdsfdv02
auto/pdsfdv03
auto/pdsfdv04
auto/pdsfdv05
auto/pdsfdv06
auto/pdsfdv07
auto/pdsfdv08
auto/pdsfdv09
auto/pdsfdv10
auto/pdsfdv13
auto/phenix
auto/sno
auto/star
auto/ucbmep
- 27 October, Friday Pieces are moved to Oakland:
cluster network switch, pdsflx00, and pdsfsu05, dv14 and dv15.
New compute nodes bought online
- 30 October, Monday Network returns to production mode
- 31 October, Tuesday Partial cluster available:
lx00 for home directories, su05 for afs->nfs, dv14 and dv15 for
disk vaults. Interactive access will be restored.
Partial batch queues will come online, using new equipment
(89 dual PIII 650's). HPSS will NOT be available at that time.
- 2 November HPSS comes back on line, although this could be
as late as 11/06 (a message will be sent with its status).
- 10 November This is the date we hope to have the rest of
PDSF installed by.
PDSF will be upgraded as tasks get completed. Messages will be
sent when major events get complete. At that point we will have
156 linux nodes, 16 data vaults, 2 suns.
General Notes:
All IP numbers will be changing for all machines in PDSF and the
machine names will be changing too. Some key machines IPs will be:
pdsflx000 - 128.55.24.100
pdsfsu00 - 128.55.24.20
pdsflx001 - pdsflx008: 128.55.24.101 - 128.55.24.108
Notice the extra digit in the name of the lx machines. And if these
numbers do happen to change, a message will follow with corrected
information. (These numbers will be in effect after 10/27.)
Also note; any disk space on pdsflx00 and pdsfsu00 will not be
affected.
For more details check
pdsf-announce mailing list archives
If there are any questions, please direct them to
Cary.
5 October 2000 6:30am - 6:45am- network to/from the lab down.
4 October 2000 10am - 12am - network down in the cluster area.
1 August 2000 9am-12pm - scheduled reboot of the PDSF cluster, all the CPU's were restarted.
Hardware Projects
- Wednesday, December 8, 1999
Changed a lot. Reconfigured all batch noded. Added machines from the e895 project
to cluster. The old starlx machines have now become the interactive nodes. We have
also added 18 new PIII/450, 4 512GB raidzone disk vaults, and 12 PIII/450 processors
to existing dual processor machine which only had 1 processor. Now all the batch
nodes have 2 Gig of swap space. Check out the hardware section for more details.
- Thursday, August 5, 1999
2GB of memory was added to starsu00 and 256GB of memory was added to starlx0[1-8]
Also we lost the 2nd 23GB drive thus /scratch and /scratch/common has been moved to /data06
- Saturday, July 10, 1999
972G of new drive space added to starsu00 in directories /data0[4-8]
- Monday, October 5, 1998
All remaining pdsf HP systems (pdsfhp1
- pdsfhp32) will be decommissioned.
- Wednesday, September 30, 1998
All remaining pdsf Sun systems (pdsfsu1
- pdsfsu24) will be decommissioned.
- Friday, September 18, 1998
The old data vaults
will be decommissioned.
- Friday, September 11, 1998
The new data vault
bank will be available for general use.
- Friday, September 11, 1998
16 Linux machines (pdsflx01
- pdsflx16) will be available for general use.
- Monday, June 22, 1998
Decommission 8 PDSF Sun systems (pdsfsu25
- pdsfsu32).
Software Projects
- Friday, July 9, 1999
The cluster was moved to 2.2.10-ac10 kernel.
- Wednesday, June 30, 1999
New cluster monitoring service became operational.
- Wednesday, April 28, 1999
LSF 3.2 bacame available for general use.
- Monday, August 17, 1998
LSF 3.1 will be available for use on the starsu
& starlx machines.
- Thursday, June 18, 1998
The following will be completed on the PDSF cluster:
- - Move the PDSF /home/common/gc5 and /home/common/star to
the PDSF cluster
- - Increase the scratch space on the Linux machines.
- - Unmount the /home/user area from the PDSF cluster.
- Implement a more flexible batch queueing system which will
perform load balancing activities.
|