About Us

Advertise

Cloud // Infrastructure as a Service

2/2/2016
11:11 AM

Charles Babcock
News

Connect Directly

3 comments
Comment Now

50%

Brookhaven Lab Finds AWS Spot Instances Hit Sweet Spot

When Brookhaven National Lab needed compute power to meet peak demand, it turned to a new Energy Sciences Network and Amazon Spot Instances.

10 Tools To Keep Your Agile Dev Projects On Track

(Click image for larger view and slideshow.)

Last September the Brookhaven National Laboratory discovered a way to expand its compute power for particle research without blowing up its research budget. As its needs outgrew its own facilities, it opted to use Amazon Spot Instances -- the virtual servers that customers can use for as long as their low bid isn't topped by someone else's.

It was a choice that seemed risky at the time. Scientists were lined up to run their research systems against mountains of data generated by the CERN Large Hadron Collider in Geneva, but neither Brookhaven nor participating university departments had enough compute capacity to satisfy their demands.

Furthermore, "science is highly competitive," observed lead computer scientist Michael Ernst for Brookhaven's ATLAS team, tied into CERN, in an interview with InformationWeek.

Ernst and the ATLAS team decided to test Spot Instance use in the cloud over a five-day period last September. Such a move might not sound like rocket science to major enterprises already liberally tapping AWS virtual servers. But the needs of particle researchers are extremely large-scale, and there were 1,500 of them waiting in line, and there are known drawbacks to using Spot Instances.

A given particle research system might need to run continuously for 24 hours. Simply because the research team lined up the Spot Instances they needed at the outset, didn't mean they'd still be available as the research ground into its 24th hour. Spot Instances are a bargain in the middle of the night, but can get shifted into higher priced Spot Instances or even On-Demand instances with the dawn of the business day.

"If a system has run 23 hours and 57 minutes, and the Spot Instance goes away, you lose everything," Ernst noted in an interview. That was one of the hazards of selecting what was, by definition, a temporary resource. Spot Instances are unused compute power in the Amazon cloud that is available at whatever price a customer cares to bid for them. They attract the low bidders and typically cost one-quarter to one-tenth of the AWS On-Demand class of servers, Ernst said.

(Image: Andrey Prokhorov/iStockphoto)

But Brookhaven needed large numbers of them in one location to deal with the terabytes of data being generated by the Large Hadron Collider. For its first major test, Ernst sought the equivalent of 50,000 physical cores to power the Spot Instances needed. The rub was that 99% of them would need to remain available throughout the five-day test period.

All 50,000 wouldn't need to be continuously available. Ernst could afford to have 1% shifted to higher bidders at any one time by pre-arranging for jobs to failover to other virtual servers. But if there was a surge in demand for Spot Instances during his trial, too many servers would be lost to finish many of the running computations.

"Nodes acquired on the Spot market can be terminated at any time, meaning applications need to tolerate disruptions," said Ernst. If the disruptions exceeded the ability of the applications to failover, there were going to be many disappointed researchers, he said.

As Brookhaven prepared its test run on Amazon, it was a rare event to have sufficient data from Hadron/ATLAS loaded into the cloud to host hundreds of research explorations at one time. It takes a trillion proton collisions in the collider to produce evidence of a single Higgs boson particle's decay. Nevertheless, understanding the Higgs boson -- the goal of many ATLAS research workloads -- promises to provide the next refinements in our understanding of the universe, possibly unlocking the secrets to gravity.

[Want to learn more about AWS 2015 results? See Amazon, AWS Post Strong Results, Fail to Please Wall Street.]

Brookhaven was able to load the data into Amazon over the Energy Science Network, operated by the US Department of Energy at 100 Gbs. Moving vast amounts of data -- 50 PBs -- at the slower speeds available over the Internet would not have been tolerable to Amazon, he said.

(Michael Ernst)

In some cases, the workloads are using vast amount of data to simulate what should happen in the proton collisions, then search through mountains of ATAS detector data looking for evidence that the theories are correct. It's a compute-intensive task, Ernst explained.

When everything was ready, Brookhaven launched the five-day Spot Instance run. "Less than 1% of the instances were terminated," leaving operations with a margin of safety. Afterward, Ernst's view of Spot Instances changed from a risky experiment to "an ideal resource for deploying our peak demand."

Instead of investing in new data center capacity, Brookhaven was able to gain capacity for its peak demand for $45,000 for the five-day run.

"AWS has superb availability," Ernst said. "It appears to have unlimited capacity at competitive prices."

Even if that was true last September, that's not necessarily guaranteed for all future large-scale users of Spot Instances. With AWS's rapid, 71.7% revenue growth in 2015, compute capacity that's now available might not be in the future.

Nevertheless, Ernst is getting ready for a second experiment on Amazon this month, relying once again on Spot Instances. He's seeking to establish once and for all that the cloud can serve as "a practical, production-grade, 100,000-core compute platform for doing science." It will be conducted over Amazon's three major North American regions: US East in Northern Virginia, US West in Northern California, and US West in Oregon.

Brookhaven has conducted a smaller, 4,000-core, month-long experiment on Google Compute Engine, but hasn't done any yet on Microsoft Azure. Ernst doesn't rule out use of any cloud site in the future.

Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.

Comment |

Email This |

Print |

RSS

More Insights

Webcasts

Achieve Continuous Testing with Intelligent Test Automation, Powered by AI

Getting Your Security Tech Together: Making Orchestration and Automation Work For Your Enterprise

More Webcasts

White Papers

2020: The Year in Security

Top 5 SaaS Misconfigurations

More White Papers

Reports

2020 Trustwave Global Security Report

The State of Cloud Computing - Fall 2020

More Reports

Comments

Newest First | Oldest First | Threaded View

[close this box]

50%

batye,
User Rank: Ninja
2/3/2016 | 10:01:03 AM

Re: The Dept. of Energy's Energy Sciences Network does double duty

@Charlie Babcock, it interesting to see how technology improving collaborative research...

Reply | Post Message | Messages List | Start a Board

50%

Charlie Babcock,
User Rank: Author
2/2/2016 | 3:35:38 PM

The Dept. of Energy's Energy Sciences Network does double duty

Brookhaven not only has the Energy Sciences Network over which to connect to Amazon at 100 Gbs, but it also has, compliments of that network two 320 Gbs lines to Europe over which it can exchange data with European partners. The high speed communications make collaborative research much more feasible.

Reply | Post Message | Messages List | Start a Board

50%

Charlie Babcock,
User Rank: Author
2/2/2016 | 3:28:07 PM

Brookhaven liked AWS memory-intensive virtual servers

The Brookhaven Lab gravitated toward the memory-intensive virtual server types in its Spot Instances. They included the R3 double extra-large, R3 quadruple extra-large, and R3 8X extra-large instances; the latter comes with two 320 GB solid state disks. Also used was the M3 double extra large server, a balanced compute, memory and networked server used with many different applications, lead scientist Ernst reported.

Reply | Post Message | Messages List | Start a Board

Editors' Choice

Gartner on Drivers and Deterrents to Cloud Adoption

Joao-Pierre S. Ruth, Senior Writer, 12/10/2020

The Year in Security: Adversarial AI and the Rush to the Cloud

Jessica Davis, Senior Editor, Enterprise Apps, 12/10/2020

10 Hot IT Job Skills for 2021

Cynthia Harvey, Freelance Journalist, InformationWeek, 12/8/2020

White Papers

2020: The Year in Security

Top 5 SaaS Misconfigurations

Infographic: Cloud Security Now

Assessing Cybersecurity Risk in Today's Enterprises

ROI Study: Economic Validation Report of the Anomali Threat Intelligence Platform

More White Papers

Webinars

Robotic Processing Can Automate Your Business Processes

Achieve Continuous Testing with Intelligent Test Automation, Powered by AI

Getting Your Security Tech Together: Making Orchestration and Automation Work For Your Enterprise

Webinar Archives

Current Issue

Why Chatbots Are So Popular Right Now

In this IT Trend Report, you will learn more about why chatbots are gaining traction within businesses, particularly while a pandemic is impacting the world.

Download This Issue!

Back Issues | Must Reads

Slideshows

10 Tech News Stories You May Have Missed This Year

0 comments | Read | Post a Comment

IT Leadership: 10 Ways the CIO Role Changed in 2020

10 Hot IT Job Skills for 2021