DISC Banner Image

S4PA

What is S4PA?

The Simple, Scalable, Script-Based, Science Product Archive (S4PA) is a radically simplified data archive architecture for supporting our GES DISC users with online access to data. S4PA is already being used operationally and its deployment will be expanded in the months and years to come.

For further information, please also see S4PA Frequently Asked Questions.

Genesis of the S4PA

Until recently, archiving at the Goddard Earth Sciences Data and Information Services Center(GES DISC) has been handled by robotic tape archives or silos. Tape silos are expensive to deploy and operate but have the advantage of scaling well for large data volumes. However, the viability of disk based archives has been enhanced by the recent NASA trend toward smaller data systems that service specific, focused communities rather than the general public. The smaller data volumes of these systems do not benefit so much from economies of scale. One such example is the REASoN CAN, two instances of which will rely on the GES DISC for data management. As a result, the GES DISC is undertaking the construction of a disk-based science archive based on our long experience in archiving requirements, design, and operations. The GES DISC's successful implementation of the Simple, Scalable Script-based Science Processor (S4P) points the path as a demonstration of the utility of Radical Simplification in implementing inexpensive, robust, scalable systems.

Radical Simplification

Owing to the limited funding of these small systems, the paramount driver is cost, for both implementation and operations. So far, the most effective cost reduction strategy has been Radical Simplification (RS). This principle is rooted in the converse: that cost rises non-linearly with complexity. This is because both code size and code interactions contribute to the cost. It is one thing to espouse Radical Simplification and quite another to implement it. RS can be applied to all phases of development, but especially to requirements and design, where the cost of total life cycle is most sensitive to variations. By itself, RS is more of a goal than a useful guideline. However, it can be expanded to several axioms that a development group can follow to achieve the

Benefits of Radical Simplification

Less Code = Less Bugs

The most obvious KISS guideline is simply to keep the system small, not through overly elegant (or obscure) code styles, but rather by implementing only the bare minimum functionality required. On a day-to-day basis, this is often translated via the Extreme Programming rule "You aren't going to need it"  or YAGNI , or the 80/20 rule (Juran & Gryna 1951) that 80% of functionality can be achieved in the first 20% of code (see below).

80/20 Rule (aka Pareto's rule)

80% of the desired functionality is usually achieved in the first 20% of effort. Thus, if at all possible, it is useful to find workarounds, alternatives, or simply ways to live without the more expensive 20% of remaining desired functionality.

"You Aren't Gonna Need It" (YAGNI)

This is a key principle of Extreme Programming. Rather than design and build many components or features that might be needed, it is more cost effective to build only that which is known to be required, accepting the risk of refactoring later if need be.

Leverage Other People's Effort

This actually covers a broad range of tactics, from simple reuse to buying off the shelf to teaming arrangements. Note that it is often useful to reuse items other than software, such as requirements, architectures, standards and even procedures.

Use the Operating System Wherever Possible

Coupled with the 80/20 rule, this involves using the operating system (i.e., the file system and process scheduling system) rather than writing more sophisticated, customized versions of various search, inventory and job scheduling functions. The idea is to leverage the vast investments into operating system kernels to make them fast and robust. For example, filename conventions are used in S4PM to identify different types of work orders, facilitate searching for data files, and to identify which jobs are nominal, failed, or overdue. Thus, the basic state of the system at any given time is encoded in the file system, allowing simple UNIX commands like 'ls' to be used to monitor the system health.

Note that several of the above principles rely on a ruthless approach to requirements. Every possible requirement should be questioned as to whether it is really necessary. Requirements in doubt should be postponed to later releases, when more will be known about its actual necessity.

S4PA Architecture

The S4PA Architecture makes three key breaks with past archive architectures. The first of these is to store the primary copy of all of the data online. A backup tape copy is kept offsite, but only for disaster recovery purposes. This provides the world with simple and immediate access to the data via FTP or HTTP, without the need to place an order for later pickup. This key feature helps to enable two other key breaks. The first is to limit media distribution under the premise that ready online access to the data makes media distribution relatively less desirable. In turn, because the system includes neither a complicated mass storage tape system nor a complex multi-step order fulfillment system, it needs no relational database to track the relationship of data items to files or to monitor order fulfillment. This reduces costs considerably, not only in terms of commercial database licenses, but also in database design and administration costs.

At its root, a science data archive needs to be able to store data in a place where it can find it and serve the right data up to users or applications that request it. In the absence of a mass storage tape system or relational database, the S4PA architecture relies simply on the UNIX directory structure. That is, data are stored by data group, year, and optionally month or day of the year. The idea is to keep the number of inodes in a given directory less than 1000. (Larger directories cause problems with file globbing and backups). The directory structure exists under ~ftp to allow anonymous FTP access.

Furthermore, the filenames under which the data are stored will have a uniform structure to make themeasy to understand and locate, including at a minimum an indicator of the dataset and the start time of the data file.

The use of UNIX directories is also used as an organizing principle for other functions. The Receiving function, for instance, is implemented for multiple data providers by giving each data provider a directory to deliver data. Again, we use methods with standard names (e.g. IdentifyDataset()) to implement provider-specific actions. Likewise, the Media system will also use directories as an organizing principle, in this case to associate the data to be backed up with specific drives (i.e., a DVDwriter).

A further simplification from other science data archive systems is the use of system backups for making offsite tape copies. This is somewhat unwieldy if the entire disk array is treated as one or two file systems, however. Therefore, the file systems will approximately be the same size as the backup tapes. Data will be distributed among these file systems as they arrive, with symbolic links used to organize the data from a user's perspective.

S4PA Subsystems

The core of the S4PA comprises Receiving and Storage. S4PA will accommodate any number of add-on components to provide value-added services, but these are not considered part of the core. The Receiving system's primary task is to detect new data coming in, extract metadata and hand it off to the appropriate Storage entity (i.e., data collection). Storage is responsible for detecting duplicate data and maintaining the data.

Finally, the S4P kernel is the engine which keeps the various components of the system running. Its chief benefits are robust polling and workflow capabilities. It also provides a graphical user interface for monitoring and failure handling.

In addition, the S4PA interfaces to the Mirador and Web Hierarchical Ordering Mechanism (WHOM) for search and access, as well as the EOS Clearinghouse (ECHO) and a number of interoperability servers, such as OpenDAP, GrADS and OpenGIS.



NASA Logo - nasa.gov

  • Last updated: July 20, 2009 20:32:10 GMT