Skip to content
Social Security Online
Office of the Inspector General
OIG Seal image
Blank Spacer Image

Audit Report - A-13-97-12014


Office of Audit

Evaluation of the Social Security Administration`s Back-up and Recovery Testing of Its Automated Systems - A-13-97-12014 - 9/24/97

TABLE OF CONTENTS

EXECUTIVE SUMMARY

INTRODUCTION

RESULTS OF REVIEW

FURTHER IMPROVEMENTS ARE NEEDED TO STRENGTHEN SSA’S OVERALL RECOVERY TESTING PROCESS

The Application Test Objectives Were Not Completed

Only 6 of the 12 Critical Workload Areas Have Been Tested to Date at COMDISCO

No Documented Performance Standards Exist to Measure Stress Test Results

Difficulties in Establishing the Support Environment and Incompatibilities between Different Facility Complexes Prevented the Successful Completion of the MTAS Workload

Tapes Sent to COMDISCO from the OSSF May Not Include All Critical Workloads

CONCLUSION AND RECOMMENDATIONS

APPENDICES

Appendix A - Critical Workloads

Appendix C - Major Contributors to this Report

Back to top

EXECUTIVE SUMMARY

OBJECTIVE

The objective of this audit was to observe and evaluate the testing of the Social Security Administration`s (SSA) back-up and recovery plan (BRP) conducted at COMDISCO from February 28 through March 2, 1997.

BACKGROUND

SSA is required by the Office of Management and Budget (OMB) Circular A-130 to have in place a disaster recovery plan for its automated systems. The recovery plan should be fully documented and periodically tested. This report is the second report that focuses on SSA`s disaster recovery planning. In the first report, we reviewed the BRP document and other related areas and concluded that SSA had made significant improvements since our prior review in 1984. We also found that SSA generally was in compliance with OMB Circular A-130. In this current report, we address the periodic testing of the BRP.

SSA operationally tests its recovery plan every 12 to 18 months. The March 1997 test was the fourth opportunity SSA has had to test its on-line environment at COMDISCO in North Bergen, New Jersey. COMDISCO is a commercial recovery facility vendor that SSA has contracted with to provide its disaster recovery support.

The primary objectives for this test were to re-establish data processing and network environments, and test the functionality of a limited number of on-line and batch environment applications. One field office (FO) from each of six SSA regions participated in this test. The applications tested were to:

  • Process initial title II claims through the Modernized Claims System (MCS) and then run the batch jobs at night.
  • Process some critical payments through the Critical Payment System (CPS) and run the batch jobs at night.
  • Process payroll through the Management Time and Attendance System (MTAS).
  • Perform some on-line query responses.

In addition to the applications being tested, SSA performed testing of the network to determine at what point the network began experiencing significant processing delays.

RESULTS OF REVIEW

The March 1997 recovery testing at COMDISCO did not meet all of its objectives. We believe the testing could have been more successful if more time had been made available to solve the start-up problems.

Further improvements are needed to strengthen SSA`s overall recovery testing process. Specifically, we found that:

  • THE APPLICATION TEST OBJECTIVES WERE NOT COMPLETED;
  • ONLY 6 OF THE 12 CRITICAL WORKLOAD AREAS HAVE BEEN TESTED TO DATE AT COMDISCO;
  • NO DOCUMENTED PERFORMANCE STANDARDS EXIST TO MEASURE STRESS TEST RESULTS;
  • DIFFICULTIES IN ESTABLISHING THE SUPPORT ENVIRONMENT AND
  • INCOMPATIBILITIES BETWEEN THE MANAGEMENT INFORMATION SERVICE FACILITY (MISF) AND THE PROGRAMMATIC PROCESSING FACILITY (PPF) COMPLEXES PREVENTED THE SUCCESSFUL COMPLETION OF THE MTAS WORKLOAD; AND
  • TAPES SENT TO COMDISCO FROM THE OFF-SITE STORAGE FACILITY (OSSF) MAY NOT INCLUDE CRITICAL FILES.

RECOMMENDATIONS

To improve the recovery testing process, we recommend that SSA:

  • Consider increasing the test time at COMDISCO from 3 to 4 days to allow more time for technicians to solve start-up problems.
  • Develop a master application test plan so that all critical workloads are tested on a cyclical basis. Plan to test the critical workload areas that have not yet been tested. Increase the number of applications being tested for the next test date.
  • Develop performance standards and certify stress test results.
  • Correct the incompatibility problem to assure that all non-PPF workloads will run at COMDISCO.
  • Continue to automate the back-up tape pick list process to select tapes that are critical to the test.

SSA COMMENTS

SSA agreed with our report recommendations to strengthen the back-up and recovery testing at COMDISCO. Appendix B of this report includes a copy of SSA’s comments to our report.

Back to top

INTRODUCTION

OBJECTIVE

The objective of this audit was to observe and evaluate the testing of SSA`s BRP conducted at COMDISCO from February 28 through March 2, 1997.

BACKGROUND

On June 29, 1993, SSA contracted with COMDISCO in North Bergen, New Jersey to provide SSA`s recovery support. The contract was amended in November 1995 to include a COMDISCO satellite location in Columbia, Maryland which would be used by SSA for test monitoring.

The March 1997 test was the fourth on-line testing opportunity SSA has had at COMDISCO. SSA ran its first test on December 12-14, 1993. As with any initial exercise, SSA had a few problems when starting up. SSA gained more experience and expanded the test objective for the August 12-14, 1994 test to include submitting transactions on-line from various FOs directly to COMDISCO. This test was successful according to SSA and demonstrated its ability to re-establish the full functionality of the Agency`s mission to resume critical operations at an alternate site. A test had been scheduled for June 2-4, 1995 but, because SSA requested more direct access storage device (DASD) memory than the contract called for, the test had to be postponed until January 1996 when more DASD became available at COMDISCO. The January 26-28, 1996 test was expanded to include not only the COMDISCO recovery facility in North Bergen, New Jersey, but also the satellite site in Columbia, Maryland for test monitoring by SSA.

For the March 1997 test, there were 34 recovery team members at North Bergen and 21 members at Columbia. The primary objectives for this test were to re-establish the data processing and network environments, and test the functionality of a limited number of on-line and batch environment applications. One FO from each of the six regions participated in this test. The applications tested were to:

  • Process initial title II claims through the MCS and then run the batch jobs at night.
  • Process some critical payments through CPS and run the batch jobs at night.
  • Process payroll through MTAS.
  • Perform some on-line query responses.

In addition to the applications being tested, SSA`s Division of Integration and Environmental Testing (DIET) performed stress testing of the network to determine at which point the network began experiencing significant processing delays.

SCOPE AND METHODOLOGY

We used several methods to gather evidence for our audit. We reviewed:

1. relevant documents, e.g., previous studies by the Office of the Inspector General and others;
2. SSA`s January 31, 1996 BRP document; and
3. SSA`s recovery test results documents for tests conducted at COMDISCO during December 1993, August 1994, and January 1996.

We observed the February 28 through March 2, 1997 recovery testing at the COMDISCO facility in North Bergen, New Jersey and interviewed SSA personnel who were at North Bergen, New Jersey and at the satellite location in Columbia, Maryland. Field work was performed at SSA Headquarters in Baltimore, Maryland and at COMDISCO in North Bergen, New Jersey and Columbia, Maryland between March 1997 and April 1997. Our audit was performed in accordance with generally accepted government auditing standards.

Back to top

RESULTS OF REVIEW

FURTHER IMPROVEMENTS ARE NEEDED TO STRENGTHEN SSA`S OVERALL RECOVERY TESTING PROCESS

The March 1997 recovery testing at COMDISCO did not meet all of its testing objectives. The disaster recovery team (DRT) was able to re-establish the data processing and network environments; however, they were unable to complete the on-line and batch application testing with the FOs. We believe that if the DRT had more time they could have completed more objectives. Improvements are needed to strengthen SSA`s overall recovery testing process.

The Application Test Objectives Were Not Completed

This test produced several new circumstances described below which resulted in an unstable operating environment when the applications were being tested on Saturday, March 1. The unstable operating environment was the result of the DRT not having enough time to resolve operating and application start-up problems which were caused by the following factors:

  • missing data files;
  • new release versions of several support software products being introduced at the same time;
  • inexperience of new personnel; and
  • new hardware.

If the DRT had more time up front to solve the start-up problems, we believe most of the test applications could have been successfully completed on March 1. SSA`s dynamic data processing and application environments are becoming more complex each year. Given these complexities and interdependencies, we believe that regardless of the extent of planning by the DRT there will always be the risk of unanticipated start-up problems.

The window of opportunity for testing on-line applications is on Saturdays when FOs are closed and the network can be switched over to COMDISCO. Late Saturday and Sunday is used to execute the batch systems, perform on-line maintenance, and purge the system of SSA test data. The DRT needs an additional 24 hours up-front (start Thursday at 8 a.m. rather than Friday at 8 a.m.) to resolve any operating start-up problems so on-line application testing can begin on time early Saturday morning.

Only 6 of the 12 Critical Workload Areas Have Been Tested to Date at COMDISCO

After four testing opportunities at COMDISCO (December 1993, August 1994, January 1996, March 1997), only 6 of the 12 critical workload areas have been tested. Of the six areas that have been tested, only the on-line queries, processing title XVI claims and MTEXT workloads have been totally successful. There was also limited success in processing post entitlement events (for example, some applications have run successfully while others have not.) See Appendix A for a list of the 12 workload areas.

We believe the reason why only 6 workloads have been tested to-date is because of incomplete planning by SSA for testing all the applications in the 12 critical workload areas. Our conclusion is based on the following points:

  • SSA does not have a multi-year (master) application test scheduling plan to ensure that all critical workload areas are tested on a cyclical basis; i.e. every 3 years. According to SSA, each test plan stands on its own merit, which means the results from each test have not been compiled for developing an overall application testing plan schedule.
  • In our discussions with SSA, there were some inconsistencies within SSA components as to what the critical workloads were within the 12 workload areas. The inconsistencies in defining the critical workloads indicate that planning needs improvement. For example, we noted inconsistencies in the latest BRP document dated January 31, 1996 which identified the critical workloads. We questioned why the 800-number system to schedule appointments and referrals was listed as a critical workload in Appendix F of the BRP but not listed in the executive summary as a critical workload. One SSA component said it was a critical workload, while another said it was not. In another example, we inquired why the MTEXT workload which had been scheduled for the March 1997 test was canceled. The reason given was because SSA now believes this workload is not critical. Originally, it was believed that some new beneficiaries would not get their checks unless the MTEXT notices were generated. Better planning would have resulted in eliminating the MTEXT workload from the critical workload list.
  • For the March 1997 recovery test, one application test objective was to process title II claims through the MCS. However, not all title II claims are processed through MCS. While all claims are initiated through MCS, if MCS identifies exceptions (such as missing Master Beneficiary Record data) the claim must then be processed either through the Claims Automated Process System (CAPS) or through the Manual Adjustment Debit, Credit and Award Process (MADCAP).

In February 1997, MCS processed 70 percent of the claims, CAPS processed 4 percent, and MADCAP processed 16 percent. Testing for only those title II claims that could be processed through MCS overlooks about 30 percent of all title II claims.

Finally, SSA only has the opportunity to test every 12 to 18 months at COMDISCO. Currently, SSA is testing between three and four applications per test date. Testing a larger number of applications would be more efficient.

No Documented Performance Standards Exist to Measure Stress Test Results

The purpose of stress testing is to determine the volume of transactions at which the network would experience significant delays. These tests are designed to simulate how the system will perform under actual conditions with a high volume of transactions being processed at one time. For this test, DIET officials said they were at about 350 transactions per second before the network began experiencing delays. In comparison, we have been told that during the peak time for a normal day, the National Computer Center (NCC) will process over 900 transactions per second. However, the DIET stress test results cannot be measured since there are no documented performance standards. Consequently, the SSA officials that we talked with could not explain if this service performance level at COMDISCO would be acceptable in a disaster situation. The results are not meaningful unless they can be measured against a stated service performance standard.

Also, for the March 1997 test, the results (350 transactions/second) that were achieved were based only on log on/off and query only transaction profiles. The profiles used for the test excluded those transactions that would have resulted in an action to update a data base. Since this was not representative of a typical daily production transaction mix at NCC, these stress results are even less meaningful. We were told that not all transaction profiles could be used for this test because of some technical limitations.

Difficulties in Establishing the Support Environment and Incompatibilities between Different Facility Complexes Prevented the Successful Completion of the MTAS Workload

Most of SSA`s critical workload applications run in the PPF complex environment; however, several applications run outside it. Examples of these applications include Falcon, PSC/OCRO batch, and MTAS which run in the MISF complex and VTAM and NETVIEW which reside in the Network Management Facility complex environment.

For the March 1997 test, SSA tested the MTAS application at COMDISCO. This was the third time the time and attendance application did not meet all of the test objectives. One reason for the problem is SSA has attempted to execute an MISF application in the PPF environment. According to SSA officials, this presents a number of logistical and technical problems, such as record blocking lengths, which to date has made the MTAS application incompatible in the PPF environment. Also, because most of the other non-PPF critical workload applications have not been tested to date, SSA has no assurance these applications will work.

Tapes Sent to COMDISCO from the OSSF May Not Include All Critical Workloads

Files to be sent to COMDISCO from the OSSF currently are judgmentally selected from over 45,000 tapes at the OSSF. This process introduces human error since all critical tapes may not be selected, thus losing valuable time in a disaster recovery situation. This condition occurred in the March 1997 test when several MTAS and IDMS files were missing.

While SSA has made some improvements in the development of the back-up tape pick list, further automation of the process is still needed. The recovery pick list should be automated since all the critical workloads are known and all the files associated with these workloads can be identified. The improvements that were made make the process more flexible in that the pick list can be generated outside the SSA complex. Prior to this improvement, the tapes had to be selected by a person located in the NCC complex. The improvements permit the Office of Systems Design and Development and the Office of Telecommunications and Systems Operations personnel to select tapes from a remote site using a lap top computer and a modem.

Back to top

CONCLUSION AND RECOMMENDATIONS

The March 1997 testing at COMDISCO did not meet all of its objectives. We believe the DRT could have been more successful if more time had been scheduled to resolve start-up problems. The DRT was able to re-establish the data processing and network environments; however, they were unable to complete the on-line and batch application testing with the FOs. Further improvements are needed to strengthen SSA`s overall recovery testing process. Specifically, we recommend that SSA:

  • Consider increasing the test time at COMDISCO from 3 to 4 days to allow more time for technicians to solve start-up problems.
  • Develop a master application test plan so that all critical workloads are tested on a cyclical basis. This should include a list and description of all the workloads that would be done in each of the 12 critical workload areas, when the workload was last tested, the results, and when it is next scheduled for testing.
  • Plan to test the critical workload areas that have not yet been tested. Also, for the next recovery test, SSA should increase the number of applications tested.
  • Develop performance standards and certify the DIET stress test results.
  • Benchmarking should be done at the NCC to establish an acceptable service performance standard at COMDISCO.
  • Analyze the environmental incompatibility problem, determine the best approach, and implement appropriate corrective action to assure that all non-PPF workloads will run at COMDISCO.
  • Continue planning to automate the back-up tape pick list process to select tapes. The personnel at the OSSF should be able to execute an inventory selection program that would automatically generate the back-up tape pick list.

SSA COMMENTS

SSA agreed with all recommendations and informed us that corrective actions are being taken.

Back to top

APPENDICES

APPENDIX A

The following critical workloads were identified from page 8 of the Social Security Administration`s January 31, 1996 back-up and recovery plan for the National Computer Center.

CRITICAL WORKLOADS

1. Claims, where payment is due within 30 days.

2. Earnings records for disability cases so that development can proceed on insured applicants.

3. Critical payments.

4. Process appeals with allowances.

5. Stop work reports.

6. Emergency enumeration requests.

7. Time and attendance systems for payroll.

8. Certification/accounting system for payments.

9. Interactive direct input for postentitlement events which affect payment.

10. On-line queries.

11. Critical processing center workloads controlled by the Processing Center Action Control System.

12. Critical incomplete notices processed through MTEXT.

To assist in processing critical workloads, the following systems facilities will be available:

  • Administrative-related support facilities such as TOP SECRET, NEWS, NETSTAT, and Model District Office informational releases.
  • The Modernized OCRO System.
  • Falcon Data Entry Software and Program Service Center Workloads.
  • Electronic mail, specifically cc-mail.

Back to top

APPENDIX C

MAJOR CONTRIBUTORS TO THIS REPORT

Office of the Inspector General

Scott Patterson, Director, Evaluations and Technical Services
Bruce Daugherty, Audit Manager
Randy Townsley, Senior Auditor

  Link to FirstGov.gov: U.S. Government portal Privacy Policy | Website Policies & Other Important Information | Site Map
Need Larger Text?
  Last reviewed or modified Monday Jan 14, 2008