Friday, January 4, 2013

Janitor Monkey - Keeping the Cloud Tidy and Clean

By Michael Fu and Cory Bennett, Engineering Tools

One of the great advantages of moving from a private datacenter into the cloud is that you have quick and easy access to nearly limitless new resources. Innovation and experimentation friction is greatly reduced: to push out a new application release you can quickly build up a new cluster, to get more storage just attach a new volume, to backup your data just make a snapshot, to test out a new idea just create new instances and get to work. The downside of this flexbility is that it is pretty easy to lose track of the cloud resources that are no longer needed or used. Perhaps you forgot to delete the cluster with the previous version of your application, or forgot to destroy the volume when you no longer needed the extra disk. Taking snapshots is great for backups, but do you really need them from 12 months ago? It's not just forgetfulness that can cause problems. API and network errors can cause your request to delete an unused volume to get lost.

At Netflix, when we analyzed our Amazon Web Services (AWS) usage, we found a lot of unused resources and we needed a solution to rectify this problem. Diligent engineers can manualy delete unused resources via Asgard but we needed a way to automatically detect and clean them up. Our solution was Janitor Monkey.

We have written about our Simian Army in the past and we are now proud to announce that the source code for the new member of our simian army, Janitor Monkey, is now open and available to the public.

What is Janitor Monkey?

Janitor Monkey is a service which runs in the Amazon Web Services (AWS) cloud looking for unused resources to clean up. Similar to Chaos Monkey, the design of Janitor Monkey is flexible enough to allow extending it to work with other cloud providers and cloud resources. The service is configured to run, by default, on non-holiday weekdays at 11 AM. The schedule can be easily re-configured to fit your business' need.

Janitor Monkey determines whether a resource should be a cleanup candidate by applying a set of rules on it. If any of the rules determines that the resource is a cleanup candidate, Janitor Monkey marks the resource and schedules a time to clean it up. We provide a collection of rules in the open sourced version that are currently used at Netflix and believed general enough to be used by most users. The design of Janitor Monkey also makes it simple to customize rules or to add new ones.

Since there can be exceptions when you want to keep an unused resource around, before a resource is deleted by Janitor Monkey, the owner of the resource will receive a notification a configurable number of days ahead of the cleanup time. This is to prevent a resource that is still needed from being deleted by Janitor Monkey. The resource owner can then flag the resources that they want to keep as exceptions and Janitor Monkey will leave them alone.

Over the last year Janitor Monkey has deleted over 5,000 resources running in our production and test environments. It has helped keep our costs down and has freed up engineering time which is no longer needed to manage unused resources.

Resource Types and Rules

Four types of AWS resources are currently managed by Janitor Monkey: Instances, EBS Volumes, EBS Volume Snapshots, and Auto Scaling Groups. Each of these resource types has its own rules to mark unused resources. For example, an EBS volume is marked as a cleanup candidate if it has not been attached to any instance for 30 days. Another example is that an instance will be cleaned by Janitor Monkey if it is not in any auto scaling group for over 3 days since we know these are experimentation instances -- all others must be in auto scaling groups. The number of retention days in these rules is configurable so the rules can be easily customized to fit your business requirements. We plan to make Janitor Monkey support more resource types in the future, such as launch configurations, security groups, and AMIs. The design of Janitor Monkey makes adding new resource types easy.

How Janitor Monkey Cleans

Janitor Monkey works in three stages: "mark, notify, delete". When Janitor Monkey marks a resource as a cleanup candidate, it schedules a time to delete the resource. The delete time is specified in the rule that marks the resource. Every resource is associated with an owner email, which can be specified as a tag on the resource. You can also easily extend Janitor Monkey to obtain this information from your internal system. The simplest way is using a default email address, e.g. your team's email list for all the resources. You can configure a number of days for specifying when to let Janitor Monkey send notification to the resource owner before the scheduled termination. By default the number is 2, which means that the owner will receive a notification 2 business days ahead of the termination date. During the 2-day period the resource owner can decide if the resource can be deleted. In case a resource needs to be retained, the owner can use a simple REST interface to flag the resource to be excluded by Janitor Monkey. The owner can later use another REST interface to remove the flag and Janitor Monkey will then be able to manage the resource again. When Janitor Monkey sees a resource marked as a cleanup candidate and the scheduled termination time has passed, it will delete the resource. The resource owner can also delete the resource manually if he/she wants to release the resource earlier to save cost. When the status of the resource changes, making the resource not a cleanup candidate (e.g. a detached EBS volume is attached to an instance), Janitor Monkey will unmark the resource and no cleanup will occur.

Configuration and Customization

The resource types managed by Janitor Monkey, the rules for each resource type to mark cleanup candidates, and the parameters used to configure each individual rule, are all configurable. You can easily customize Janitor Monkey with the most appropriate set of rules for your resources by setting Janitor Monkey properties in a configuration file. You can also create your own rules or add support for new resource types, and we encourage you to contribute your cleanup rules to the project so that all can benefit.

Auditing, Logging, and Costs

Janitor Monkey events are logged in an Amazon SimpleDB table by default. You can easily check the SimpleDB records to find out what Janitor Monkey has done. The resources managed by Janitor Monkey are also stored in SimpleDB. At Netflix we have a UI for managing the Janitor Monkey resources and we have plans to open source it in the future as well.

There could be associated costs with Amazon SimpleDB, but in most cases the activity of Janitor Monkey should be small enough to fall within Amazon's Free Usage Tier. Ultimately the costs associated with running Janitor Monkey are your responsibility. For your reference, the costs of Amazon SimpleDB can be found at http://aws.amazon.com/simpledb/pricing/

Coming Up

In the near future we are planning to release some new resource types for Janitor Monkey to manage. As mentioned earlier, the next candidate will likely be launch configuration. Also, we will add support for using Edda to implement existing and new Janitor Monkey rules. Edda allows us to query the history of resources, helping Janitor Monkey find unused resources more accurately and reliably.

Summary

Janitor Monkey helps keep our cloud clean and clutter-free. We hope you find Janitor Monkey to be useful for your business. We'd appreciate any feedback on it. We're always looking for new members to join the team. If you are interested in working on great open source software, take a look at jobs.netflix.com for current openings!



Monday, December 31, 2012

A Closer Look At The Christmas Eve Outage

by Adrian Cockcroft

Netflix streaming was impacted on Christmas Eve 2012 by problems in the Amazon Web Services (AWS) Elastic Load Balancer (ELB) 
service that routes network traffic to the Netflix services supporting streaming. The postmortem report by AWS can be read here.

We apologize for the inconvenience and loss of service. We’d like to explain what happened and how we continue to invest in higher availability solutions.

Partial Outage

The problems at AWS caused a partial Netflix streaming outage that started at around 12:30 PM Pacific Time on December 24 and grew in scope later that afternoon. The outage primarily affected playback on TV connected devices in the US, Canada and Latin America. Our service in the UK, Ireland and Nordic countries was not impacted.

Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls. Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar devices tend to depend on specific ELBs. Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application. Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass requests to the servers behind them. None of the other AWS services failed, so our applications continued to respond normally whenever the requests were able to get through.

The Netflix Web site remained up throughout the incident, supporting sign up of new customers and streaming to Macs and PCs, although at times with higher latency and a likelihood of needing to retry. Over-all streaming playback via Macs and PCs was only slightly reduced from normal levels. A few devices also saw no impact at all as those devices have an ELB configuration that kept running throughout the incident, providing normal playback levels.

At 12:24 PM Pacific Time on December 24 network traffic stopped on a few ELBs used by a limited number of streaming devices. At around 3:30 PM on December 24, network traffic stopped on additional ELBs used by game consoles, mobile and various other devices to start up and load lists of TV shows and movies. These ELBs were patched back into service by AWS at around 10:30 PM on Christmas Eve, so game consoles etc. were impacted for about seven hours. Most customers were fully able to use the service again at this point. Some additional ELB cleanup work continued until around 8 am on December 25th, when AWS finished restoring service to all the ELBs in use by Netflix, and all devices were streaming again.

Even though Netflix streaming for many devices was impacted, this wasn't an immediate blackout. Those devices that were already running Netflix when the ELB problems started were in many cases able to continue playing additional content.

Christmas Eve is traditionally a slow Netflix night as many members celebrate with families or spend Christmas Eve in other ways than watching TV shows or movies. We see significantly higher usage on Christmas Day and increased streaming rates continue until customers go back to work or school.  While we truly regret the inconvenience this outage caused our customers on Christmas Eve, we were also fortunate to have Netflix streaming fully restored before a much higher number of our customers would have been affected.

What Broke And What Should We Do About It

In its postmortem on the outage, AWS reports that ...data was deleted by a maintenance process that was inadvertently run against the production ELB state data”. This caused data to be lost in the ELB service back end, which in turn caused the outage of a number of ELBs in the US-East region across all availability zones starting at 12:24 PM on December 24.

The problem spread gradually, causing broader impact until At 5:02 PM PST, the team disabled several of the ELB control plane workflows”.

The AWS team had to restore the missing state data from backups, which took all night. By 5:40 AM PST ... the new ELB state data had been verified.”. AWS has put safeguards in place against this particular failure, and also says We are confident that we could recover ELB state data in a similar event significantly faster”.

Netflix is designed to handle failure of all or part of a single availability zone in a region as we run across three zones and operate with no loss of functionality on two.  We are working on ways of extending our resiliency to handle partial or complete regional outages.

Previous AWS outages have mostly been at the availability zone level, and we’re proud of our track record in terms of up time, including our ability to keep Netflix streaming running while other AWS hosted services are down.

Our strategy so far has been to isolate regions, so that outages in the US or Europe do not impact each other.

It is still early days for cloud innovation and there is certainly more to do in terms of building resiliency in the cloud.
In 2012 we started to investigate running Netflix in more than one AWS region and got a better gauge on the complexity and investment needed to make these changes.

We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures. Naive approaches could have the downside of being more expensive, more complex and cause new problems that might make the service less reliable. Look for upcoming blog posts as we make progress in implementing regional resiliency.

As always, we are hiring the best engineers we can find to work on these problems, and are open sourcing the solutions we develop as part of our platform.

Happy New Year and best wishes for 2013.


Thursday, December 20, 2012

Building the Netflix UI for Wii U

Hello, my name is Joubert Nel and I’m a UI engineer on the TV UI team here at Netflix. Our team builds the Netflix experiences for hundreds of TV devices, like the PlayStation 3, Wii, Apple TV, and Google TV.

We recently launched on Nintendo’s new Wii U game console. Like other Netflix UIs, we present TV shows and movies we think you’ll enjoy in a clear and fast user interface. While this UI introduces the first Netflix 1080p browse UI for game consoles, it also expands on ideas pioneered elsewhere like second screen control.


Virtual WebKit Frame

Like many of our other device UIs, our Wii U experience is built for WebKit in HTML5. Since the Wii U has two screens, we created a Virtual WebKit Frame, which partitions the UI into one area that is output to TV and one area that is output to the GamePad.

This gives us the flexibility to vary what is rendered on each screen as the design dictates, while sharing application state and logic in a single JavaScript VM. We also have a safe zone between the TV and GamePad areas so we can animate elements off the edge of the TV without appearing on the GamePad.

We started off with common Netflix TV UI engineering performance practices such as view pooling and accelerated compositing. View pooling reuses DOM elements to minimize DOM churn, and Accelerated Compositing (AC) allows us to designate certain DOM elements to be cached as a bitmap and rendered by the Wii U’s GPU.

In WebKit, each DOM node that produces visual output has a corresponding RenderObject, stored in the Render Tree. In turn, each RenderObject is associated with a RenderLayer. Some RenderLayers get backing surfaces when hardware acceleration is enabled . These layers are called compositing layers and they paint into their backing surfaces instead of the common bitmap that represents the entire page. Subsequently, the backing surfaces are composited onto the destination bitmap. The compositor applies transformations specified by the layer’s CSS -webkit-transform to the layer’s surface before compositing it. When a layer is invalidated, only its own content needs to be repainted and re-composited. If you’re interested to learn more, I suggest reading GPU Accelerated Compositing in Chrome.


Performance

After modifying the UI to take advantage of accelerated compositing, we found that the frame rate on device was still poor during vertical navigation, even though it rendered at 60fps in desktop browsers.

When the user browses up or down in the gallery, we animate 4 rows of poster art on TV and mirror those 4 rows on the GamePad. Preparing, positioning, and animating only 4 rows allows us to reduce (expensive) structural changes to the DOM while being able to display many logical rows and support wrapping. Each row maintains up to 14 posters, requiring us to move and scale a total of 112 images during each up or down navigation. Our UI’s posters are 284 x 405 pixels and eat up 460,080 bytes of texture memory each, regardless of file size. (You need 4 bytes to represent each pixel’s RGBA value when the image is decompressed in memory.)


Layout of poster art in the gallery



To improve performance, we tried a number of animation strategies, but none yielded sufficient gains. We knew that when we kicked off an animation, there was an expensive style recalculation. But the WebKit Layout & Rendering timeline didn’t help us figure out which DOM elements were responsible.

WebKit Layout & Rendering Timeline


We worked with our platform team to help us profile WebKit, and we were now able to see how DOM elements relate to the Recalculate Style operations.

Our instrumentation helps us visualize the Recalculate Style call stack over time:
Instrumented Call Stack over Time



Through experimentation, we discovered that for our UI, there is a material performance gain when setting inline styles instead of modifying classes on elements that participate in vertical navigation.

We also found that some CSS selector patterns cause deep, expensive Recalculate Style operations. It turns out that the mere presence of the following pattern in CSS triggers a deep Recalculate Style:

.list-showing #browse { … }

Moreover, a -webkit-transition with duration greater than 0 causes the Recalculate Style operations to be repeated several times during the lifetime of the animation.
After removing all CSS selectors of this pattern, the resulting Recalculate Style shape is shallower and consumes less time.


Delivering great experiences

Our team builds innovative UIs, experiments with new concepts using A/B testing, and continually delivers new features. We also have to make sure our UIs perform fast on a wide range of hardware, from inexpensive consumer electronics devices all the way up to more powerful devices like the Wii U and PS3.

If this kind of innovation excites you as much as it does me, join our team!











Monday, December 17, 2012

Complexity In The Digital Supply Chain

Netflix launched in Denmark, Norway, Sweden, and Finland on Oct. 15th. I just returned from a trip to Europe to review the content deliveries with European studios that prepared content for this launch.

This trip reinforced for me that today’s Digital Supply Chain for the streaming video industry is awash in accidental complexity. Fortunately the incentives to fix the supply chain are beginning to emerge. Netflix needs to innovate on the supply chain so that we can effectively increase licensing spending to create an outstanding member experience. The content owning studios need to innovate on the supply chain so that they can develop an effective, permanent, and growing sales channel for digital distribution customers like Netflix. Finally, post production houses have a fantastic opportunity to pivot their businesses to eliminate this complexity for their content owning customers.

Everyone loves Star Trek because it paints a picture of a future that many of us see as fantastic and hopefully inevitable. Warp factor 5 space travel, beamed transport over global distances, and automated food replicators all bring simplicity to the mundane aspects of living and free up the characters to pursue existence on a higher plane of intellectual pursuits and exploration.

The equivalent of Star Trek for the Digital Supply Chain is an online experience for content buyers where they browse available studio content catalogs and make selections for content to license on behalf of their consumers. Once an ‘order’ is completed on this system, the materials (video, audio, timed text, artwork, meta-data) flow into retailers systems automatically and out to customers in a short and predictable amount of time, 99% of the time. Eliminating today’s supply chain complexity will allow all of us to focus on continuing to innovate with production teams to bring amazing new experiences like 3D, 4K video, and many innovations not yet invented to our customer’s homes.

We are nowhere close to this supply chain today but there are no fundamental technology barriers to building it. What I am describing is largely what www.netflix.com has been for consumers since 2007, when Netflix began streaming. If Netflix can build this experience for our customers, then conceivably the industry can collaborate to build the same thing for the supply chain. Given the level of cooperation needed, I predict it will take five to ten years to gain a shared set of motivations, standards, and engineering work to make this happen. Netflix, especially our Digital Supply Chain team, will be heavily involved due to our early scale in digital distribution.

To realize the construction of the Starship Enterprise, we need to innovate on two distinct but complementary tracks. They are:
  1. Materials quality: Video, audio, text, artwork, and descriptive meta data for all of the needed spoken languages
  2. B2B order and catalog management: Global online systems to track content orders and to curate content catalogs

Materials Quality
Netflix invested heavily in 2012 in making it easier to deliver high quality video, audio, text, art work, and meta data to Netflix. We expanded our accepted video formats to include the de facto industry standard of Apple Pro Res. We built a new team, Content Partner Operations, to engage content owners and post production houses and mentor their efforts to prepare content for Netflix.

The Content Partner Operations team also began to engage video and audio technology partners to include support for the file formats called out by the Netflix Delivery Specification in the equipment they provide to the industry to prepare and QC digital content. Throughout 2013 you will see the Netflix Delivery Specification supported by a growing list of those equipment manufacturers. Additionally the Content Partner Operations team will establish a certification process for post production houses ability to prepare content for Netflix. Content owners that are new to Netflix delivery will be able to turn any one of many post production houses certified to deliver to Netflix from all of our regions around the world.

Content owners ability to prepare content for Netflix varies considerably. Those content owners who perform the best are those who understand the lineage of all of the files they send to Netflix. Let me illustrate this ‘lineage’ reference with an example.

There is a movie available for Netflix streaming that was so magnificently filmed, it won an Oscar for Cinematography. It was filmed widescreen in a 2.20:1 aspect ratio but it was available for streaming on Netflix in a modified 4:3 aspect ratio. How can this happen? I attribute this poor customer experience to an industry wide epidemic of ‘versionitis’. After this film was produced, it was released in many formats. It was released in theaters, mastered for Blu-ray, formatted for airplane in flight viewing and formatted for the 4x3 televisions that prevailed in the era of this film. The creation of many versions of the film makes perfect sense but versioning becomes versionitis when retailers like Netflix neglect to clearly specify which version they want and when content owners don’t have a good handle on which versions they have. The first delivery made to Netflix of this film must have been derived from the 4x3 broadcast television cut. Netflix QC initially missed this problem and we put this version up for our streaming customers. We eventually realized our error and issued a re-delivery request from the content owner to receive this film in the original aspect ratio that the filmmakers intended for viewing the film. Versionitis from the initial delivery resulted in a poor customer experience and then Netflix and the content owner incurred new and unplanned spending to execute new deliveries to fix the customer experience.

Our recent trip to Europe revealed that the common theme of those studios that struggled with delivery was versionitis. They were not sure which cut of video to deliver or if those cuts of video were aligned with language subtitle files for the content. The studios that performed the best have a well established digital archive that avoids versionitis. They know the lineage of all of their video sources and those video files’ alignment with their correlated subtitle files.

There is a link between content owner revenue and content owner delivery skill. Frequently Netflix finds itself looking for opportunities to grow its streaming catalogs quickly with budget dollars that have not yet been allocated. Increasingly the Netflix deal teams are considering the effectiveness of a content owner’s delivery abilities when making those spending decisions. Simply put, content owners who can deliver quickly and without error are getting more licensing revenue from Netflix than those content owners suffering from versionitis and the resulting delivery problems.

B2B order and catalog management
Today Netflix has a set of tools for managing content orders and curating our content catalogs. These tools are internal to our business and we currently engage the industry for delivery tracking through phone calls and emails containing spreadsheets of content data.

We can do a lot better than to engage the industry with spreadsheets attached to email. We will rectify this in the first half of 2013 with the release of the initial versions of our Content Partner Portal. The universal reaction to reviewing our Nordic launch with content owners was that we were showing them great data (timeliness, error rates, etc) about their deliveries but that they need to see such data much more frequently. The Content Partner Portal will allow all of these metrics to be shared in real time with content owner operations teams while the deliveries are happening. We also foresee that the Content Partner Portal will be used by the Netflix deal team to objectively assess the delivery performance of content owners when planning additional spending.

We also see a role for shared industry standards to help with delivery tracking and catalog curation. The EIDR initiative, for identifying content and versions of content, offers the potential for alignment across companies in the Digital Supply Chain. We are building the ability to label titles with EIDR into our new Content Partner Portal.

Final thoughts
Today’s supply chain is messy and not well suited to help companies in our industry to fully embrace the rapidly growing channel of internet streaming. We are a long way from the Starship Enterprise equivalent of the Digital Supply Chain but the growing global consumer demand for internet streaming clearly provides the incentive to invest together in modernizing the supply chain.

Netflix has many initiatives underway to innovate in developing the supply chain in 2013, some of which were discussed in this post, and we look forward to continuing to collaborate with our content owning partners supply chain innovation efforts.

Netflix is hiring for open positions in our Digital Supply Chain team. Please visit http://jobs.netflix.com to see our open positions. We also put together a short video about the supply chain for a recent job fair. Here is a link to that video.

Kevin McEntee
VP Digital Supply Chain

Tuesday, December 11, 2012

Hystrix Dashboard + Turbine Stream Aggregator

by Ben Christensen, Puneet Oberai and Ben Schmaus

Two weeks ago we introduced Hystrix, a library for engineering resilience into distributed systems. Today we're open sourcing the Hystrix dashboard application, as well as a new companion project called Turbine that provides low latency event stream aggregation.


The Hystrix dashboard has significantly improved our operations by reducing discovery and recovery times during operational events. The duration of most production incidents (already less frequent due to Hystrix) is far shorter, with diminished impact, because we are now able to get realtime insights (1-2 second latency) into system behavior.

The following snapshot shows six HystrixCommands being used by the Netflix API. Under the hood of this example dashboard, Turbine is aggregating data from 581 servers into a single stream of metrics supporting the dashboard application, which in turn streams the aggregated data to the browser for display in the UI.


When a circuit is failing then it changes colors (gradient from green through yellow, orange and red) such as this:

The diagram below shows one "circuit" from the dashboard along with explanations of what all of the data represents.

We've purposefully tried to pack a lot of information into the dashboard so that engineers can quickly consume and correlate data.



The following video shows the dashboard operating with data from a Netflix API cluster:



The Turbine deployment at Netflix connects to thousands of Hystrix-enabled servers and aggregates realtime streams from them. Netflix uses Turbine with a Eureka plugin that handles instances joining and leaving clusters (due to autoscaling, red/black deployments, or just being unhealthy).

Our alerting systems have also started migrating to Turbine-powered metrics streams so that in one minute of data there are dozens or hundreds of points of data for a single metric. This high resolution of metrics data makes for better and faster alerting.

The Hystrix dashboard can be used either to monitor an individual instance without Turbine or in conjunction with Turbine to monitor multi-machine clusters:



Turbine can be found on Github at: https://github.com/Netflix/Turbine

Dashboard documentation is at: https://github.com/Netflix/Hystrix/wiki/Dashboard

We expect people to want to customize the UI so the javascript modules have been implemented in a way that they can easily be used standalone in existing dashboards and applications. We also expect different perspectives on how to visualize and represent data and look forward to contributions back to both Hystrix and Turbine.

We are always looking for talented engineers so if you're interested in this type of work contact us via jobs.netflix.com.


Monday, December 10, 2012

Videos of the Netflix talks at AWS Re:Invent

by Adrian Cockcroft

Most of the talks and panel sessions at AWS Re:Invent were recorded, but there are so many sessions that it's hard to find the Netflix ones. Here's a link to all of the videos posted by AWS that mention Netflix: http://www.youtube.com/user/AmazonWebServices/videos?query=netflix

They are presented below in what seems like a natural order that tells the Netflix story, starting with the migration and video encoding talks, then talking about availability, Cassandra based storage, "big data" and security architecture, ending up with operations and cost optimization. Unfortunately a talk on Chaos Monkey had technical issues with the recording and is not available.

Embracing the Cloud

Presented by Neil Hunt - Chief Product Officer, and Yury Israilevsky - VP Cloud and Platform Engineering.

Join the product and cloud computing leaders of Netflix to discuss why and how the company moved to Amazon Web Services. From early experiments for media transcoding, to building the operational skills to optimize costs and the creation of the Simian Army, this session guides business leaders through real world examples of evaluating and adopting cloud computing.

Slides: http://www.slideshare.net/AmazonWebServices/ent101-embracing-the-cloud-final



Netflix's Encoding Transformation

Presented by Kevin McEntee, VP Digital Supply Chain.

Netflix designed a massive scale cloud based media transcoding system from scratch for processing professionally produced studio content. We bucked the common industry trend of vertical scaling and, instead, designed a horizontally scaled elastic system using AWS to meet the unique scale and time constraints of our business. Come hear how we designed this system, how it continues to get less expensive for Netflix, and how AWS represents a transformative opportunity in the wider media owning industry.

Slides: http://www.slideshare.net/AmazonWebServices/med202-netflixtranscodingtransformation



Highly Available Architecture at Netflix

Presented by Adrian Cockcroft (@adrianco) Director of Architecture

This talk describes a set of architectural patterns that support highly available services that are also scalable, low cost, low latency and allow agile continuous deployment development practices. The building blocks for these patterns have been released at netflix.github.com as open source projects for others to use.

Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha




Optimizing Your Cassandra Database on AWS

Presented by Ruslan Meshenberg - Director of Cloud Platform Engineering and Gregg Ulrich - Cassandra DevOps Manager

For a service like Netflix, data is crucial. In this session, Netflix details how they chose and leveraged Cassandra, a highly-available and scalable open source key/value store. In this presentation they discuss why they chose Cassandra, the tools and processes they developed to quickly and safely move data into AWS without sacrificing availability or performance, and best practices that help Cassandra work well in AWS.
Slides: http://www.slideshare.net/AmazonWebServices/dat202-cassandra



Data Science with Elastic Map Reduce

Presented by Kurt Brown - Director, Data Science Engineering Platform

In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.
Slides: http://www.slideshare.net/AmazonWebServices/bdt303-netflix-data-science-with-emr



Security Panel

Featuring Jason Chan, Director of Cloud Security Architecture.

Learn from fellow customers, including Jason Chan of Netflix, Khawaja Shams of NASA, and Rahul Sharma of Averail, who have leveraged the AWS secure platform to build business critical applications and services. During this panel discussion, our panelists share their experiences utilizing the AWS platform to operate some of the world’s largest and most critical applications.




How Netflix Operates Clouds for Maximum Freedom and Agility

Presented by Jeremy Edberg (@jedberg), Reliability Architect

In this session, learn how Netflix has embraced DevOps and leveraged all that Amazon has to offer to allow our developers maximum freedom and agility.
Slides: http://www.slideshare.net/AmazonWebServices/rmg202-devops-atnetflixreinvent



Optimizing Costs with AWS

Presented by Coburn Watson - Manager, Cloud Performance Engineering

Find out how Netflix, one of the largest, most well-known and satisfied AWS customers, develop and run their applications efficiently on AWS. The manager of the Netflix Cloud Performance Engineering team outlines a common-sense approach to effectively managing AWS usage costs while giving the engineers unconstrained operational freedom.
Slides: http://www.slideshare.net/cpwatson/aws-reinvent-optimizing-costs-with-aws


Intro to Chaos Monkey and the Simian Army

Presented by Ariel Tsetlin - Director of Cloud Solutions

Why were the monkeys created, what makes up the Simian Army, and how do we run and manage them in the production environment.
Slides: http://www.slideshare.net/AmazonWebServices/arc301netflixsimianarmy

Unfortunately the video recording had technical problems.

In Closing...


We had a great time and enjoyed the opportunity to have a large number of Netflix executives, managers and architects tell the "Netflix in the Cloud" story in much more detail than usual. Hopefully this summary makes it easier to watch all our talks and follow that story.

Monday, December 3, 2012

AWS Re:Invent was Awesome!

by Adrian Cockcroft

There was a very strong Netflix presence at AWS Re:Invent in Las Vegas this week, from Reed Hastings appearing in the opening keynote, to a packed series of ten talks by Netflix management and engineers, and our very own expo booth. The event was a huge success, over 6000 attendees, great new product and service announcements, very well organized and we are looking forward to doing it again next year.

Wednesday Morning Keynote

The opening keynote with Andy Jassy contains an exciting review of the Curiosity Mars landing showing how AWS was used to feed information and process images for the watching world. Immediately afterwards (at 36'40") Andy sits down with Reed Hastings.


Reed talks about taking inspiration from Nicholas Carr's book "the Big Switch" to realize that cloud would be the future, and over the last four years, Netflix has moved from initial investigation to having deployed about 95% of our capacity on AWS. By the end of next year Reed aims to be 100% on AWS and to be the biggest business entirely hosted on AWS apart from Amazon Retail. Streaming in 2008 was around a million hours a month, now it's over a billion hours a month. A thousandfold increase is over four years is difficult to plan for, and while Netflix took the risk of being an early adopter of AWS in 2009, we were avoiding a bigger risk of being unable to build out capacity for streaming ourselves. "The key is that now we're on a cost curve and an architecture... that as all of this room does more with AWS we benefit, by that collective effect that gets you to scale and brings prices down."

Andy points out that Amazon Retail competes with Netflix in the video space, and asks what gave Reed the confidence to move to AWS. Reed replies that Jeff Bezos and Andy have both been very clear that AWS is a great business that should be run independently and the more that Amazon Retail competes with Netflix, the better symbol Netflix is that it's safe to run on AWS. Andy replies "Netflix is every bit as important a customer of AWS as Amazon Retail, and that's true for all of our external customers".

The discussion moves onto the future of cloud, and Reed points out that as wonderful as AWS is, we are still in the assembly language phase of cloud computing. Developers shouldn't have to be picking individual instance types, just as they no longer need to worry about CPU register allocation because compilers handle that for them. Over the coming years, the cloud will add the ability to move live instances between instance types. We can see that this is technically possible because VMware does that today with VMotion, but bringing this capability to public cloud would allow cost optimization, improvements in bi-sectional bandwidth and great improvements in efficiency. There are great technical challenges to do this seamlessly at scale, and Reed wished Andy well in tackling these hard problems in the coming years.

The second area of future development is consumer devices that are touch based, understand voice commands and are backed by ever more powerful cloud based services. For Netflix, the problem is to pick the best movies to show on a small screen for a particular person at that point in time, from a huge catalog of TV shows and movies. The ability to cheaply throw large amounts of compute power at this ranking problem lets Netflix experiment rapidly to improve the customer experience.

In the final exchange, Andy asks what advice he can give to the audience, and Reed says to build products that you find exciting, and to watch House of Cards on Netflix on February 1st next year.

Next Andy talks about the rate at which AWS introduces and updates products, from 61 in 2010, to 82 in 2011 to 158 in 2012. He then went on to introduce AWS Redshift, a low cost data warehouse as a service that we are keen to evaluate as we replace our existing datacenter based data warehouse with a cloud based solution.

Along with presentations from NASDAQ and SAP, Andy finished up with examples of mission critical applications that are running on AWS, including including a huge diagram showing the Obama For America election back end, consisting of over 200 applications. We were excited to find out that the OFA tech team were using the Netflix open source management console Asgard to manage their deployments on AWS, and to see the Asgard icon scattered across this diagram. During the conference we met the OFA team and many other AWS end users who have also started using various @NetflixOSS projects.

Thursday Morning Keynote

The second day keynote with Werner Vogels started off with Werner talking about architecture. Starting around 43 minutes in he describes some 21st Century Architectural patterns which are being used by Amazon.com, AWS itself, and are also very similar to the Netflix architectural practices. After a long demo from Matt Wood that used the AWS Console to laboriously do what Asgard does in a few clicks there is an interesting description of how S3 was designed for resilience and scalability by Alyssa Henry, the VP of Storage Services for AWS.


Werner returns to talk about some more architectural principles, a customer talk from Animoto, then announces two new high end instance types that will become available in the coming weeks. The cr1.8xlarge has 240GB of RAM and two 120GB solid state disks, it's ideal for running in memory analytics. The hs1.8xlarge has 114GB of RAM and twenty four 2TB hard drives in the instance, it's ideal for running data warehouses, and is clearly the raw back end instance behind the Redshift data warehouse product announced the day before. Finally he discussed data driven architectures and introduces AWS Data Pipeline, then Matt Wood comes on again to do a demo.

Thursday Afternoon Fireside Chat

The final keynote, fireside chat with Werner Vogels and Jeff Bezos has interesting discussions of lean start-up principles and the nature of innovation. At 29'50" they discuss Netflix and the issues of competition between Amazon Prime and Netflix. Jeff says there is no issue, "We bust our butt every day for Netflix", and Werner says the way AWS works is the same for everyone, there are no special cases for Amazon.com, Netflix or anyone else.


The discussion continues with an introduction to the 10,000 year clock and the Blue Origin vertical take off and vertical landing spaceship that Jeff is also involved in as side projects.

Netflix in the Expo Hall and @NetflixOSS

The exhibition area was impressive, with many interesting vendors that highlight the strong ecosystem around AWS. Netflix had a small booth which was aimed primarily at recruiting, but also provided a place to meet with the speakers and to meet people using the @NetflixOSS platform components. Over the last year Netflix has been gradually open sourcing our platform. While we aren't finished yet, it is now emerging as a way for other companies to rapidly adopt the same highly available architecture on AWS that has been very successful for Netflix.

More Coming Soon

There were a large number of presentations at AWS Re:Invent, the organizers have stated that videos of all the presentations will be posted to their YouTube channel, and some slides are already on http://www.slideshare.net/amazonwebservices. Netflix also archives its presentations on slideshare.net/netflix and we plan to link to the videos of Netflix talks when they are posted, here's a list of what's coming, with links to some of the slides.


Wed 1:00-1:45
Coburn Watson
Optimizing Costs with AWS

Wed 2:05-2:55
Kevin McEntee
Netflix’s Transcoding Transformation

Wed 3:25-4:15
Neil Hunt / Yury Izrailevsky
Netflix: Embracing the Cloud

Wed 4:30-5:20
Adrian Cockcroft
High Availability Architecture at Netflix

Thu 10:30-11:20
Jeremy Edberg
Rainmakers – Operating Clouds

Thu 11:35-12:25
Kurt Brown
Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25
Jason Chan
Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50
Adrian Cockcroft
Compute & Networking Masters Customer Panel

Thu 3:00-3:50
Ruslan Meshenberg/Gregg Ulrich
Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55
Ariel Tseitlin
Intro to Chaos Monkey and the Simian Army