NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
  Updating the Testbed Configuration for FY 2004

Updating the Testbed Configuration for FY 2004

The testbed system has provided us with a useful facility for developing the benchmark methodology and special benchmark codes for the GUPFS project. It has also been useful in helping to establish the credibility of GUPFS with technology vendors, and in building relationships with various technology vendors. However, with the inexorable progression of technical advancements, it became apparent that the testbed was inadequate in size for conducting the types and levels of technology evaluations needed for the GUPFS project in FY 2004

Technological advancements over the last year have outstripped the ability of the existing testbed to incorporate them. Experience with the testbed and attempts to integrate new storage and fabric technology into it demonstrated that more nodes were needed in the testbed to allow emerging advanced technologies to be integrated into it for evaluation, even in the near term. Given the technologies that the GUPFS project plans to begin evaluating in FY 2004, it was clear that the existing testbed could not accommodate their inclusion.

In addition to the testbedbeing limited in accommodating new technologies, practical experience during FY 2003 indicated to us that the testbed could be improved in a number of ways that would increase the speed at which file system evaluations could be conducted, in conjunction with specific combinations of fabrics and high-performance storage. Foremost of these was that the testbed should be able to conduct multiple, simultaneous, independent evaluations with node sets of various sizes. Next was that the testbed be able to reconfigure the fabric and storage connections much more easily. Other important areas to improve were to greatly increase the Gigabit Ethernet connectivity for iSCSI and cross-fabric testing, and to only have a single node type used in testing. Based on these needs, we designed a testbed upgrade to facilitate more simultaneous evaluations and much more rapid and easy reconfiguration. The design considerations for the update to the testbed and discussed in the following section.

 

We completed the improved testbed system design at the end of the third quarter of FY 2003. We then identified, tested, and procured the components during the fourth quarter.These were��� assembled and integrated them into the existing testbed system at the end of the fourth quarter of FY 2003 in preparation for the planned FY 2004 activities. The configuration of the expanded testbed system is discussed in detail later in this section.

1.��� Updated Testbed Design Considerations

The design of the upgrade to the GUPFS project testbed system was predicated on a number of factors. These included (1) the lessons learned and (2) limitations encountered from using the FY 2003 testbed to integrate and test new technologies, and (3) the new and emerging technologies expected to be investigated in the next year; these are discussed below, along with (4) a brief review of other factors affecting the upgrade.

Lessons Learned

The FY 2003 testbed included many design features that improved its usability and utility over the FY 2002 testbed. However, a number of lessons were learned through the use of the updated FY 2003 testbed and the evaluations conducted on it. These directly impacted the design of the testbed upgrade for FY 2004. The major lessons learned are:

          Multiple evaluations need to be done simultaneously. An evaluation of a technology component is complex, and time consuming to both set up and conduct. The reality of the situation is that a technology component cannot be evaluated in isolation. In general, one of the GUPFS component technology types can only be evaluate in conjunction with one or more of the other component types. For example, to evaluate a file system, it is necessary to test it in conjunction with storage and some fabric or interconnect connecting the clients to the storage, In addition, each component usually has required versions of software, necessitating client systems being configured with specific OS and driver versions and specific fabric connectivity. Setting up the required configurations to conduct specific evaluations is time consuming, often in the order of several weeks or a month. Often, a single evaluation takes several months and requires dedicated resources. Because of the number of important component technology issues that need to be investigated, particularly in the file system arena and multi-cluster configurations, and because of the rapidly changing component technologies, it is vital to have enough resources to be able to have extended unrelated evaluations in progress simultaneously.

          Only a single type of compute node should be used. The FY 2003 testbed contained two types of compute nodes � the 18 dual Pentium-4 compute nodes and the 5 legacy dual Pentium-3 compute nodes. Because of the limited number of available compute nodes and the high volume of items to evaluate or investigate, it was necessary to use both types of nodes for evaluations. This predictably caused problems in a number of areas. First, the different node architectures required distinct software configurations be built, installed, and tested for each type, thereby increasing the administrative burden for the testbed. Secondly, the hardware components and performance differences between the types of nodes made it very difficult to compare results obtained from evaluations, and made evaluations using both types of nodes at the same time too difficult to understand. The solution to this difficulty is to standardize on a single node type for use as compute nodes in evaluations, and to obtain adequate numbers of them to conduct the necessary number of evaluations simultaneously at adequate scale.

          A separate interactive node is needed. The FY 2003 testbed used a single dual Pentium-3 node as both a management node for running and maintaining the testbed environment and as an interactive node for the project members to log into and from which to conduct tests and evaluations. With the increased size and complexity of the testbed and the increased number of evaluations in progress at one time, it became apparent to us that a single node of this type could not perform both functions without impacting one or the other. In addition to periodic failures and substantial delays caused by overloading the management node, the launching of long-running benchmarks frequently prevented timely maintenance activities, such as rebooting the management node to clear software problems or activate patches. Another reason to separate the management and interactive functionality was to reduce the possibility of the management node accidentally being destroyed by the setting up evaluations and running of benchmarks by the project members, both activities that frequently required running with elevated privileges, leading to a number of close calls. Management node functionality is much more difficult and time consuming to configure and install than interactive node functionality. The necessity of separating the management and interactive node functionality contributed to the decision to stop using any of the Pentium-3 nodes as compute nodes, and to assign them to testbed support roles.

 

Limitations

In addition to lessons learned, the design of the FY 2004 testbed was also influenced by certain limitations encountered while using the FY 2003 testbed. These limitations include:

          The EMC CX600 did not meet expectations. The EMC CX 600 storage device did not live up to expectations regarding performance scalability. This was partly the result of unexpected architectural limitations and partly the result of configuration constraints. As a consequence, neither the expected aggregate bandwidth nor the desired scalability was achieved. This made the GUPFS project dependent on other scalable storage that was being evaluated, such as the Yotta Yotta NetStorager and the 3PARdata Inserv. With the completion of the 3PARdata evaluation and the Yotta Yotta extended beta test at the beginning of the 4th quarter of FY 2003, the GUPFS project faced the prospect of not having storage that was of high enough performance or sufficiently scalable to use for the file system evaluation planned for the end of FY 2003 and for all of FY 2004.

          The number of Gigabit Ethernet switch ports was inadequate. The GUPFS testbed�s Gigabit Ethernet fabric quickly became limited by the 48 available switch ports. The available ports were quickly consumed by the testbed systems, inter-switch links between the two Gigabit switches, original iSCSI router and Intel iSCSI HBA connections, and the InfiniCon InfiniBand to Gigabit Ethernet bridges. With the introduction of additional equipment requiring Gigabit Ethernet connections, the number of available ports was oversubscribed by at least a factor of two. This resulted in the serialization of evaluations and made it necessary to disconnect various equipment in order to connect and test other equipment. The additional equipment included Topspin InfiniBand to Gigabit Ethernet fabric bridges, Panasas storage devices, Adaptec iSCSI HBAs and TOE cards, and the inter-switch links to the Alvarez management Ethernet switch. Based on the need to maintain adequate inter-switch bandwidth, a 3x expansion of the Gigabit switch ports was needed.

          Reconfiguring the physical connections between storage, fabric elements, and client systems became extremely difficult. In order conduct evaluations of various file systems, fabric components and bridges, and storage combinations, and to conduct evaluations of various loaner equipment, it was necessary to change the physical fiber optical connections of the equipment connected to the three 16 port Fibre Channel switches in order to connect the correct set of components. This was made extremely difficult by the rigidity of the bundles of fiber cables and the fragility of the connectors. Moving a fiber between one switch and another often required major efforts to obtain adequate slack to permit the connector to plug into another switch. This was particularly a problem when we were evaluating new equipment such as Fibre Channel switches, which might be physically mounted in a different cabinet. Making such connections required that substantial time be devoted to rebundling fibers or stringing new ones. In addition, replugging the connectors exposed them to mechanical failure (the 2 Gb/s SFPs are especially fragile), and to contamination of the optics with dust. Another problem with making changes to the fiber configuration was that it soon be came very difficult to determine what was connected to what and which fibers were active. A fiber patch panel is needed to resolve these problems.

          A single dedicated metadata server node is not enough. It became apparent that a single special-purpose node acting as a dedicated metadata server was inadequate. Nearly all of the shared file systems being tested required either a metadata server or a centralized lock manager. As a consequence, testing of these file systems became serialized because of the single metadata/lock server. Frequently, more than one of these file systems was in some stage of the installation and evaluation cycle; sometimes a file system would be undergoing several tests at once, each instance having different hardware and/or software configurations. This required either very careful alternating of test segments, or the suborning other nodes to become additional metadata servers. Using other testbed nodes as secondary metadata servers impacted other activities by limiting the number of nodes available to them. Similar constraints applied to testing configurations supporting metadata/lock server redundant operation and failover. At least one more special purpose Pentium-4 node with an identical configuration needs to be dedicated to the metadata/lock server role.

          A full complement of eight 4U Pentium-4 nodes is needed. The GUPFS testbed only had six 4U Pentium-4 nodes. Most new technologies are initially implemented on standard height (4U) PCI/PCI-X cards, and only in the second or third generation of the technology do low-profile (2U) cards become available. The initial Gigabit Ethernet, 1x InfiniBand, 4x InfiniBand, Myrinet 2000, iSCSI HBA, Gigabit Ethernet TOE cards, and 1 Gb/s Fibre Channel cards were all standard height cards, requiring 4U cases. Most fabrics provide switching capabilities as powers of 2, and frequently with 8 ports as a minimum, leading to a standard purchase of an 8-port switch and 8-host interface cards, as 4-host cards provide little in the way of insights about scalability. Earlier constraints limited the testbed to six 4U Pentium-4 nodes, preventing eight-way evaluations when standard height interface cards were required. Adding two more 4U Pentium-4 nodes would enable eight-way evaluations requiring standard height interface cards to be conducted, permitting more direct comparisons with other evaluations.

New and Emerging Technologies

The design changes for the FY 2004 testbed were influenced by the new and emerging technologies impacting the GUPFS solution, which are expected to be available for evaluation during the coming year. In this regard, several important issues need to be investigated in the near term, including:

          Conducting cross-platform file system tests. The GUPFS project plans to conduct cross-platform file system tests to explore functionality and deployment issues in a heterogeneous environment that involves multiple hardware and different OS architectures, which is designed to mimic the NERSC environment in which GUPFS will be deployed. These tests will require either incorporation of additional systems into the testbed, or opening up the testbed to other NERSC systems, both of which require additional fabric-switching capabilities.

          Conducting multiple cluster file system tests. The GUPFS project plans to conduct file system tests involving multiple clusters accessing the same file system and storage simultaneously, as is expected in the NERSC environment at deployment. PDSF, Alvarez, and Dev2 are likely candidate peer systems. This will require opening up and connecting the testbed to these other NERSC systems.

          Evaluating 4 Gb/s and 10 Gb/s Fibre Channel. Both 4 Gb/s and 10 Gb/s production quality Fibre Channel equipment will be becoming available in the time frame of the initial phase of GUPFS deployment. Because of the anticipated aggregate performance needs for production use � and as it is likely that the backend storage controllers, if not the storage itself, for most shared file system solutions will be Fibre Channel connected � these technologies are likely to be important to a successful GUPFS deployment. As such, they need to be evaluated and understood.

          Evaluating 10 Gb/s Ethernet. The 10 Gb/s Ethernet technology is expected to be deployed during FY 2004. Because it is most likely that PDSF will be accessing the GUPFS file system over the Ethernet, 10 Gb/s Ethernet is a likely component of the deployed GUPFS solution and needs to be understood in a storage fabric context.

          Evaluating Panasas file system and storage. The Panasas ActiveScale File System is a very interesting object-based file system implemented over the Ethernet. Architecturally, it is quite similar to Lustre, but is more standards based, being implemented with a variant of iSCSI. The Panasas file system offering is integrated with Ethernet-attached storage devices specific to the file system, and can be accessed either through integrated NFS and CIFS gateways, or as part of a shared file system through the DirectFlow client software. The Panasas file system should be accessible over any IP-based fabric that can bridge to the Ethernet. This is a promising candidate file system and needs to be evaluated for GUPFS.

          Evaluating the IBRIX file system. The IBRIX file system is an interesting potential GUPFS file system solution that is based on federating the individual file systems of storage engines (SEs). The IBRIX file system is distributed over IP networks. It utilizes back-end SAN based storage. The IBRIX file system was originally scheduled to be available in preproduction versions for evaluation in FY 2003. This schedule has slipped into FY 2004.

          Evaluating the IBM TotalStorage SANFS (StorageTank) file system. The IBM StorageTank file system was renamed the TotalStorage SANFS file system at the end of FY 2003. SANFS is expected to become available as a product in the first half of FY 2004. It targets very large numbers of client systems, and supports multiple hardware architectures and operating systems. It uses metadata servers in conjunction with block storage accessed via iSCSI over IP networks(making it largely fabric agnostic), or accessed directly by Fibre Channel. The ability to access remote tanks over the WAN is being developed. SANFS is an extremely promising GUPFS candidate, although quite young, and needs to be investigated thoroughly.

          Further iSCSI investigations. The iSCSI protocol is making an appearance in several promising shared file systems. It also provides a cheap mechanism for accessing block storage over inexpensive fabrics, although at the expense of the higher processor overhead. With the ability of most fabrics and interconnects to perform IP transfers, and with the ability of most fabrics to bridge to the Ethernet, iSCSI may facilitate the implementation of heterogeneous fabrics directly tied into cluster interconnects. However, it needs much more investigation as its availability and use expand.

Additional Considerations

Other considerations that affected the design of the FY 2004 testbed include:

          InfiniBand technology refresh needed.The testbed InfiniBand technology needs to be refreshed. Current second-generation 4x InfiniBand equipment, particularly HCAs and Fibre Channel and Gigabit gateways need to be acquired. The original 1x IB equipment and the loaner first generation 4x IB equipment are no longer supported.

          Multiple management nodes needed. The testbed needs at least two management nodes. The management node is currently a single central point of failure in the testbed, and is extremely difficult and complex to configure. A second management node is needed to ensure the testbed and GUPFS project evaluations and investigations can continue if the existing management node fails. In addition, the availability of a second management node would allow the management node software versions and configurations to be upgraded one at a time without disruption.

          Gigabit Ethernet emerging as the standard fabric-to-others bridge. Gigabit Ethernet is emerging as the common fabric to which all other fabrics and interconnects bridge. Because of this, a large number of Gigabit Ethernet switch ports are needed in the testbed, particularly in conjunction with the iSCSI, cross-platform, and multi-cluster file system testing planned for FY 2004.

          The Myrinet to Gigabit Ethernet bridge would expand the file systems that can be tested with Alvarez. Myricom announced a Gigabit Ethernet bridge blade for their Myrinet switches, with 8 Gigabit Ethernet ports. With such a fabric bridge, the number of shared file systems that could be tested on and in conjunction with the LBNL Alvarez Linux cluster increases substantially. IP-based file systems such as Panasas and IBRIX could be evaluated for scalability. Block-based file systems, such as StorNext, could be tested for scalability using iSCSI bridged to storage in the GUPFS testbed. In addition, a Myrinet upgrade to the Rev D card in late FY 2003 allowed low-profile PCI-X Myrinet 2000 cards to be installed, enabling the Myrinet network to be installed in 2U nodes, thus freeing up the 4U nodes for other uses and enabling the full use of the Myrinet switch with 8 hosts.

2.��� Updated Testbed Configuration for FY 2004

Design of the updated testbed configuration was completed at the end of the third quarter of FY 2003. This design was based on all of the considerations presented in the previous section. The central tenet of the updated configuration was to increase the number of simultaneous evaluations that could be conducted, increase the maximum scale of these evaluations, and simplify the process of physically reconfiguring the connectivity from testbed nodes to storage devices through various fabric components.

The updated configuration expanded the total number of Pentium-4 nodes from 22 to a total of 36. As in FY 2003, four of these Pentium-4 nodes were retained as dedicated special purpose nodes, although there were some changes in assigned functions. The remaining 32 Pentium-4 nodes were assigned as compute nodes dedicated to running benchmarks and conducting other investigations. To facilitate conducting multiple simultaneous independent evaluations, the 32 compute nodes were logically partitioned into four sets of eight. This logical partitioning allows up to four independent 8-way investigations to be conducted simultaneously, or in various size combinations, such as one 32-way, two 16-way, or one 16-way and two 8-way tests.

While the partitioning of the compute nodes into groups was at a logical level, there were some physical characteristics related to their partitioning. Each group of eight compute nodes was connected to a separate Dell Power Connect Gigabit Ethernet switch, which allowed the nodes in the group maximum communication performance among themselves. The Dell switches for each of the groups were then connected by four-way trunks to a central Extreme 7i switch. This allowed nodes in any of the groups to communicate with each other, but reduced aggregate bandwidth and increased latency. Another physical characteristic related to the partitioning of the nodes was the additional PCI-X fabric interface cards each node had. For a variety of reasons, the GUPFS project conducts evaluations of fabric interfaces/interconnects with a minimum of eight hosts for each fabric. In addition to Fibre Channel interfaces, present on all nodes except the management nodes, the GUPFS testbed contains three other sets of high-performance fabrics, each of which is connected to eight compute nodes. These additional fabrics are a Myrinet 2000 interconnect, and after the testbed upgrade, two 4x InfiniBand fabrics from different vendors. The three sets of nodes with extra fabric connections are each put into logically separate groups to facilitate the independent testing of these extra fabrics.

An additional element of the updated configuration was the dispensation of the original six Pentium-3 nodes. One of these has always been used as the testbed management node. The remaining five were used as compute nodes in FY 2002 and 2003, although in an auxiliary role during FY 2003. With the addition of more Pentium-4 nodes in the updated configuration, it was possible to stop using the Pentium-3 nodes as compute nodes and assign them to other supporting duties. One was to become a second testbed management node for redundancy and simplifying upgrades. Another was to become a dedicated interactive node, offloading this function from the management nodes for the reasons discussed earlier. The remaining three Pentium-3 nodes became dedicated development nodes for benchmark and analysis code development, and possible auxiliary HPSS integration investigation roles.

Another part of the upgraded configuration included the installation of a fiber-optic patch panel, allowing all fiber-optic ports to be centrally connected in a static configuration and then cross connected as necessary using easily movable fiber-optic patch cords. All fiber-optical Gigabit Ethernet, Myrinet, and Fibre Channel host adapters, switch ports, and device connections were hardwired into the central patch panel to simplify physical reconfiguration of the fabrics and connections.

The other major element of the updated configuration included purchasing the Yotta Yotta NetStorager as the standard high-performance storage to be used in evaluations in lieu of the disappointing CX 600, the previously mentioned Gigabit Ethernet switching capacity, additional Fibre Channel switching capacity, and 4x InfiniBand technology refresh from two vendors. A front view of the updated testbed for FY 2004 appears as Figure 1. A rear view of the testbed, showing the nodes and cable connections, appears in Figure 2. The updated FY 2004 testbed configuration is shown in Figure 3. The Port Fibre optical patch panel is shown in Figure 4.

 

 

Figure 1. The FY 2004 testbed, with the Refurbished NetStorager in front.

The following major components were added to the testbed as part of its technology upgrade for FY 2004:

       Fourteen additional dual Pentium-4 nodes: 12 in 2U cases and 2 in 4U cases (these nodes were identical to those already in the testbed.

       One 64-port 2 Gb/s Fibre Channel Qlogic SANbox2-64 switch with 48 ports

       Five 24-port Dell Power Connect 5224 Gigabit Ethernet switches

       One Yotta Yotta NetStorager GSX 2400 Disk Storage Subsystem

       InfiniCon ISIS InfinIO 7000 switch and fabric bridge 4x InfiniBand upgrade

       One Topspin TS90 4x InfiniBand switch and Fibre Channel gateway

       One Myrinet 2000 MS-SW16-8E switch line card with 8 Gigabit Ethernet ports

       A Fiber Optic patch panel and cables for Gigabit Ethernet, Myrinet, and Fibre Channel

 

Figure 2. Rear view of the FY2004 testbed.

The new Pentium-4 nodes obtained as part of the upgrade were as identical as possible to those obtained the previous year. They were configured with the same motherboard, the same quantity and performance grade of memory, the same 2.2 GHz Xeon CPU, and the same Intel Gigabit Ethernet and Qlogic Fibre Channel PCI-X cards. All of these components differed from the originals only in revision numbers. The nodes were all equipped with the same speed (10,000 RPM) and capacity (36 GB), and U160 SCSI disks, but from a different manufacturer.

Special efforts were made to configure the new Pentium-4 nodes to be identical to those obtained earlier. This was done to ensure uniformity of performance so that results obtained from both sets would be comparable and they could be intermixed without affecting evaluation results. The motherboard proved to be the most difficult to obtain as it was being phased out. However, a newer revision of the motherboard was acquired that showed nearly identical performance, which allowed the new and existing Pentium-4 nodes to be intermixed with negligible impact on the benchmark results.

 

 

Figure 3. Updated GUPFS testbed configuration for FY 2004.

The major components of the updated GUPFS testbed for FY 2004 included:

System nodes

       36 dual Pentium-4 nodes: 28 in 2U cases and six in 4U cases; 32 for use as compute nodes and 4 for use as special-purpose nodes

       6 dual Pentium-3 nodes in 4U cases, used as management, interactive, and auxiliary testbed support nodes

Fabric

       One 240 connector Fiber Optical patch panel with SC connectors and patch cables (see Figure 4)

       Ethernet

o       One 32-port Extreme 7i Gigabit Ethernet switch

o       One 16-port Extreme 5i Gigabit Ethernet switch

o       Five 24-port Dell Power Connect 5224 Gigabit Ethernet switches

o       Two 10/100 Ethernet switches for system management

       Fibre Channel

o       One 48-port 2 Gb/s Fibre Channel Switch (SanBox2-64)

o       One 16-port 2 Gb/s Fibre Channel Switch (Brocade 3800)

o       One 16-port 1 Gb/s Fibre Channel Switch (Brocade 2800)

o       One Cisco SN5428 iSCSI Router fabric bridge to Ethernet

       Myrinet

o       One Myrinet 2000 8-port switch with 8 Revision D host interface cards

o       One Myrinet 2000 MS-SW16-8E switch card with 8 Gigabit Ethernet ports for bridging between Myrinet and Gigabit Ethernet

       InfiniBand

o       One InfiniCon ISIS InfinIO 7000 4x InfiniBand switch, 8 4x HCA host adapters, and single fabric bridge modules for Fibre Channel and Gigabit Ethernet

o       One Topspin TS90 4x InfiniBand switch, 8 4x HCA host adapters, and single fabric bridge modules for Fibre Channel and Gigabit Ethernet

Storage

       Yotta Yotta NetStorager GSX 2400

       EMC CLARiiON CX600 disk subsystem

       Dot Hill 7124 RAID disk subsystem

       Silicon Gear Mercury II RAID subsystem

       Chaparral A8526 RAID subsystem with attached storage

The expanded testbed, with its increased scale, updated and new technologies, and features to support easier reconfiguration, will facilitate evaluations to be conducted during FY 2004. The increased scale of the testbed will enable both more simultaneous small-scale initial evaluations and single larger scale evaluations. The testbed�s multiple fabrics and the bridges between them will allow the issues involving heterogeneous fabric environments, such as those expected at NERSC, to be investigated and understood.

 

 

Figure 4. 240-Port Fibre optical patch panel.

The increased number of testbed nodes will also facilitate the conducting of cross-platform and multiple OS tests of promising file systems supporting such capabilities. A great deal of important information and experience stands to be gained through these tests, which will explore the issues associated with deployment in a heterogeneous environment, as is expected at NERSC. The increased Gigabit Ethernet fabric switching capacity and additional improved fabric bridging capabilities will further facilitate this testing and will enable multiple-cluster testing to be conducted with both the Alvarez cluster and PDSF systems. This will allow phased deployment to be simulated, and we can then begin addressing the networking issues associated with deployment.


LBNL Home
Page last modified: Tue, 22 Jun 2004 22:56:22 GMT
Page URL: http://www.nersc.gov/projects/GUPFS/testbed/GUPFS_testbed04.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science