Computer vision at scale with Hadoop and Storm

Recently, the team at Flickr has been working to improve photo search. Before our work began, Flickr only knew about photo metadata — information about the photo included in camera-generated EXIF data, plus any labels the photo owner added manually like tags, titles, and descriptions. Ironically, Flickr has never before been able to “see” what’s in the photograph itself.

Over time, many of us have started taking more photos, and it has become routine — especially with the launch last year of our free terabyte* — for users to have many un-curated photos with little or no metadata. This has made it difficult in some cases to find photos, either your own or from others.

So for the first time, Flickr has started looking at the photo itself**. Last week, the Flickr team presented this technology at the May meeting of the San Francisco Hadoop User’s Group at our new offices in San Francisco. The presentation focuses on how we scaled computer vision and deep learning algorithms to Flickr’s multi-billion image collection using technologies like Apache Hadoop and Storm. (In a future post here, we’ll describe the learning and vision systems themselves in more detail.)

Slides available here: Flickr: Computer vision at scale with Hadoop and Storm

Thanks very much to Amit Nithian and Krista Wiederhold (organizers of the SFHUG meetup) for giving us a chance to share our work.

If you’d like to work on interesting challenges like this at Flickr in San Francisco, we’d like to talk to you! Please look here for more information: http://www.flickr.com/jobs

* Today is the first anniversary of the terabyte!

** Your photos are processed by computers – no humans look at them. The automatic tagging data is also protected by your privacy settings.

Flickr API Going SSL-Only on June 27th, 2014

If you read my recent post, you know that the Flickr API fully supports SSL. We’ve already updated our web and mobile apps to use HTTPS, and you no longer need to use “secure.flickr.com” to access the Flickr API via SSL. Simply update your code to call:

https://api.flickr.com/

We want communication with Flickr to be secure, all the time. So, we are tightening things up. Effective this week, all new API keys will work via HTTPS only. On June 27th, we will deprecate non-SSL access to the API. If you haven’t already made the change to HTTPS, now is the time!

Blackout Tests

In preparation for the June 27th cut-off date, we will run two “blackout” tests, each for 2 hours, so that you can ensure that API calls in your app no longer use HTTP. If you have changed your code to use HTTPS, your app should function normally during the blackout window. If you have not changed your code to use HTTPS, then during the 2-hr blackout window all API calls from your application will fail. The API will return a 403 status code for non-SSL requests.

Important Dates and Times

  • Change in new API keys:  6 May 2014 (If you request a new API key after 6 May, it will be issued for HTTPS only)
  • First blackout window: 3 June 2014, 10:00-12:00 Pacific Daylight Time (PDT) / 17:00-19:00 GMT
  • Second blackout window: 17 June 2014, 18:00-20:00 (PDT) / 18 June 2014, 01:00-03:00 GMT
  • Non-SSL calls deprecated: 27 June 2014, 10:00 (PDT) / 17:00 GMT

More About the Flickr API Endpoints

In the API documentation, all of the endpoints have been updated to HTTPS. While OAuth adds security by removing the need to authenticate by username and password, sending all traffic over SSL further protects our users’ data while in transit.

The SSL endpoints for the Flickr API are:


https://api.flickr.com/services/rest/


https://api.flickr.com/services/soap/


https://api.flickr.com/services/xmlrpc/

And for uploads:


https://up.flickr.com/services/upload/


https://up.flickr.com/services/replace/

For applications that use well-established HTTP client libraries, this switch should only require updating the protocol and (maybe) some updated configuration.

We realize that this change might be more difficult for some. We will follow the Developer Support Group closely, so please let us hear your questions. We will respond to them there, and will collect questions of general interest in the below FAQ.

FAQs About the Transition to SSL-Only for the Flickr API

Question:  I only have a Flickr API key because I use an application or plugin that calls the Flickr API.  Will that application or plugin continue to work after June 27th?  Do I need to do something?
Answer:  An application or plugin that calls the Flickr API will stop working on June 27th if its owner does not make the changes we’ve described above.  There are many, many providers of such services and plugins.  We have notified them about this transition via email, blog post, developer lists, and on Twitter.  As a user of such a service or plugin, you have no action to do for the transition unless the application or plugin owner asks you to upgrade to a new version.  You also have the option to reach out to the application or plugin owner to assure yourself of their plans to handle this transition.

Question: Are all http://www.flickr.com urls going to be HTTPS from now on?
Answer: Yes, all http://www.flickr.com urls returned by the API are now HTTPS, and all requests to HTTP in the browser are redirected to HTTPS.

Question: Should I switch my code to HTTPS right away or should I wait a bit?
Answer: Switch now. The important thing is for your app to be changed to HTTPS before the first blackout on 3 June 2014. In fact, if your app is a mobile app, the earlier the better, so that your users will be more likely to upgrade before the first blackout.

Question:  Do I need a new API key to replace my old one for this transition?
Answer:  No, you don’t need a new API key for this transition.  You keep your existing key, and you change the code where you call the Flickr API, so that you call it with the HTTPS protocol, instead of HTTP. Change it to this:


https://api.flickr.com/

Question:  What do I do with the new API key that is being issued by Flickr as of May 6?
Answer:  We are not automatically issuing a new key to you.  What happened on 6 May was a change to how we handle new API keys.  From now on, if you submit a request for a new key, that new key will only support calls to the Flickr API over HTTPS; it will not support calls to the API over HTTP.  You do not need to request a new API key for this transition.

Question:  I received your email about the API going SSL-only on June 27th.  Do I need to do something?
Answer:  Maybe not.  If you already call the API over HTTPS, then you’re good.  No action needed.  But if your code currently calls the API over HTTP, then, YES, you do need to do something.  In your code you need to change the protocol to HTTPS.  Like this:


https://api.flickr.com/

Question: If I use a protocol-less call to the API or match the protocol of the page that is making the call, do I need to change anything?  
Answer:  
Yes, you should change your calls to specifically use HTTPS. During the blackout period and starting on June 27th, protocol-less calls to the Flickr API from non-SSL pages will fail.

Question:  Will the old https://secure.flickr.com endpoints continue to be supported in addition to the new https://api.flickr.com endpoints, or will only the latter be supported?
Answer:  Yes, https://secure.flickr.com endpoints will still be supported.  If you use that today, your application will continue to work during the blackout windows and after 27 June.

Posted in API

Building Flickr’s new Hybrid Signed-Out Homepage

Adventures in Frontend-Landia

tl;dr: Chrome’s DevTools: still awesome. Test carefully on small screens, mobile/tablets. Progressively enhance “extraneous”, but shiny, features where appropriate.

Building a fast, fun Slideshow / Web Page Hybrid

Every so often, dear reader, you may find yourself with a unique opportunity. Sometimes it’s a chance to take on some crazy ideas, break the rules and perhaps get away with some front-end skullduggery that wouldn’t be allowed, nor encouraged under normal circumstances. In this instance, Flickr’s newest Signed-Out Homepage turned out to be just that sort of thing.

The 2014 signed-out flickr.com experience (flickr.com/new/) is a hybrid, interactive blend of slideshow and web page combining scroll and scaling tricks, all the while highlighting the lovely new Flickr mobile apps for Android and iPhone with UI demos shown via inline HTML5 video and JS/CSS-based effects.

Flickr.com scroll-through demo

Features

In 2013, we covered performance details of developing a vertical-scrolling page using some parallax effects, targeting and optimizing for a smooth experience. In 2014, we are using some of the same techniques, but have added some new twists and tricks. In addition, there is more consideration for some smaller screens this year, given the popularity of tablet and other portable devices.

Briefly:

  • Fluid slideshow-like UI, scale3d() and zoom-based scaling of content for larger screens

  • Inline HTML5 <video>, “retina” / hi-DPI scale (with fallback considerations)

  • Timeline-based HTML transition effects, synced to HTML5 video

  • “Hijacking” of touch/mouse/keyboard scroll actions, where appropriate to experience

  • Background parallax, scale/zoom and blur effects (where supported)

Usability Considerations: Scrolling

In line with current trends, our designers intended to have a slideshow-like experience. The page was to be split into multiple “slides” of a larger presentation, with perhaps some additional navigation elements and cues to help the user move between slides.

Out in the wild, implementations of the slideshow-style web page widely in their flexibility. Controlling the presentation like this is challenging and dangerous from a technical perspective, as the first thing you are doing is trying to prevent the browser from doing what it does well (arbitrary bi-directional scrolling, in either staggered steps or smooth inertia-based increments depending on the method used) in favour of your own method which is more likely to have holes in its implementation.

If you’re going to hijack a basic interaction like scrolling, attention to detail is critical. Because you’ve built something non-standard, even in the best case the user may notice and think, “That’s not how it normally scrolls, but it responded and now I’m seeing the next page.” If you’re lucky, they could be using a touchpad to scroll and may barely notice the difference.

By carefully managing the display of content to fit the screen and accounting for common scroll actions, we are able to confidently override the browser’s default scroll behaviour in most cases to present a unique experience that’s a hybrid of web page and slideshow.

The implementation itself is fairly straightforward; you can listen to the mouse wheel event (triggered both by physical wheels and touchpads), determine which direction the user is moving in, debounce further wheel events and then run an animation to transition to the next slide. It’s imperfect and subject to double-scrolling, but most users will not “throw” the scroll so hard that it retains enough inertia and continues to fire after your animation ends.

Additionally, if the user is on an OS that shows a scrollbar (i.e., non-OS X or OS X with a mouse plugged in), they should be able to grab and drag the scrollbar and navigate through the page that way. Don’t even try messing with that stuff – your users will kill you with pitchforks, ensuring you will be sent to Web Developer Usability Anti-Pattern Hell. You will not pass Go, and will not collect $200.

Content Sizing

In order to get a slideshow-like experience, each “slide” had to be designed to fit within common viewport dimensions. We assumed roughly 1024×768, but ended up targeting a minimum viewport height of around 600px – roughly what you’d get on a typical 13″ MacBook laptop with a maximized window and a visible dock. In retrospect, that doesn’t feel like a whole lot of space; it’s important to consider if you’re also aiming to display your work on mobile screens, as well.

Once each slide fit within our target dimensions, the positioning of each slide’s content could be tightly controlled. Each is in a relatively-positioned container so they stack vertically as normal, and the height is at minimum, the height of the viewport or the natural offsetHeight dictated by the content itself. Reasonable defaults are first assigned by CSS, and future updates are done via JS at initial render and on window.resize().

With each slide being one viewport high, one might assume we could then let the user scroll freely through the content, perusing at will. We decided to go against this and control the scrolling for a few reasons.

  • Web browsers’ default “page down” (spacebar or page up/down keys, etc.) does not scroll through 100% of the viewport, as we would want in this case; there is always some overlap from the previous page. While this is completely logical considering the context of reading a document, etc., we want to scroll precisely to the beginning of the next frame. Thus, we use JS to animate and set scrollTop.

  • Content does not normally shift vertically when the user resizes their browser, but will now due to JS adjusting each slide’s height to fit the viewport as mentioned. Thus, we must also adjust scrollTop to re-align to the current slide, preventing the content from shifting as the user resizes the window. Sneaky.

  • We want to know when a user enters and leaves a slide, so we can play or reset HTML5 <video> elements and related animations as appropriate. By controlling scroll, we have discrete events for both.

Content Scaling

Given that we know the dimensions of our content and the dimensions of the browser viewport, we are able to “zoom” each slide’s absolutely-positioned content to fit nicely within the viewport of larger screens. This is a potential minefield-type feature, but can be applied selectively after careful testing. Just like min and max-width, you can implement your own form of min-scale and max-scale.

Content Scaling demo

Avoiding Pixelation

Scaling raster-based content, of course, is subject to degrading pretty quickly in terms of visual quality. To help combat pixelation, scaling is limited to a reasonable maximum – i.e., 150% – and where practical, retina/hi-DPI (@2x) assets are used for elements like icons, logos and so forth, regardless of screen type. This works rather well on standard LCDs. On the hi-DPI side, thankfully, huge retina screens are not common and there is less potential for scaling.

Depending on browser, content scaling can be done via scale3d() or the old DOM .style.zoom property (yes, it wasn’t just meant for triggering layout in old IE.) From my findings, Webkit appears to rasterize all content before scaling it. As a result, vector-based content like text is blurred in Webkit when using scale3d(). Thus, Wekbit gets the older .style.zoom approach. Firefox doesn’t support .style.zoom, but does render crisp text when using scale3d().

There are few tricks to getting scaling to work, short of updating it alongside initial render and window.resize() events. overflow: hidden may need to be applied to the frame container, in the scale3d() case.

JS Performance: window.onscroll() and window.onresize()

It’s no secret: scroll and resize are two popular JavaScript events that can cause a lot of layout thrashing. Some cost is incurred by the browser’s own layout, decoding of images, compositing and painting, but most notable thrashing is caused by developers attaching expensive UI refresh-related functions to these events. Parallax effects on scrolling is a popular example, but resize can trigger it as well.

In this case, synchronous code fires on resize so that the frames immediately resize themselves to fit the new window dimensions, and the window’s scrollTop property is adjusted to prevent any vertical shift of content. This is expensive, but is justified in keeping the view consistent with what the user would expect during resize.

Scroll events on this page are throttled (that is, there is not a 1:1 event-firing-to-code-running ratio) so that the parallax, zoom and blur effects on the page – which can be expensive when combined – are updated at a lower, yet still responsive interval, thus lowering the load on rendering during scroll.

Fun stuff: Background sizing, Parallax, Scale-based Motion, Blur Effects via Opacity, Video/HTML Timelines

The parallax thing has been done before, by Flickr and countless other web sites. This year, some twists on the style included a gradual blur effect introduced as the user scrolls down the page, and in some cases, a slight motion effect via scaling.

Backgrounds and Overlays

For this fluid layout, the design needed to be flexible enough that exact background positioning was not a requirement. We wanted to retain scale, and also cover the browser window. A fixed-position element is used in this case, width/height: 100%, background-size: cover and background-position: 50% 0px, which works nicely for the main background and additional image-based overlays that are sometimes shown.

The background tree scene becomes increasingly blurry as the user scrolls through the page. CSS-based filters and canvas were options, but it was simpler to apply these as background images with identical scaling and positioning, and overlay them on top of the existing tree image. As the user scrolls through the top half of the page, a “semi-blur” image is gradually made visible by adjusting opacity. For the latter half, the semi-blur is at 100% and a third “full-blur” image is faded in using the same opacity approach.

Where supported, the background also also scales up somewhat as the user scrolls through the page, giving the effect of forward motion toward the trees. It is subtle when masked by the foreground content, but still noticeable.

Here is an example with the content hidden, showing how the background moves during scroll.

Background parallax/blur/zoom demo

Parallax + Scaling

In terms of parallax, a little extra image is needed for the background to be able to move. Thus, the element containing the background images is width: 100% and height: 110%. The background is scaled by the browser to fit the container as previously described, and the additional 10% height is off-screen “parallax buffering” content. This way, the motion is always relative in scale and consistent with the background.

HTML5 Video and “Timelines” in JS

One of the UI videos in this page shows live filters being applied – “Iced Tea”, “Throwback” and so on, and we wanted to have those filters showing outside the video area also if possible. Full-screen video was considered briefly, but wasn’t appropriate for this design. Thus, it was JS to the rescue. By listening to a video’s timeupdate event and watching the currentTime attribute, events could be queued in JS with an associated time, and subsequently fired roughly in sync with effects in the video.

Filter UI demo

In this case, the HTML-based effects were simple CSS opacity transitions triggered by changing className values on a parent element.

When a user leaves a slide, the video can be reset when the scroll animation completes, and any filter / transition-based effects can also be faded out. If the user returns to the slide, the video and effects seamlessly restart from their original position.

HTML5 Video Fallbacks

Some clients treat inline HTML5 video specially, or may lack support for the video formats you provide. Both MP4 (H.264) and WebM are used in this case, but there’s still no guarantee of support. Tablet and mobile devices are unlikely to allow auto-play of video, may show a play arrow-style overlay, or may only play video in full-screen mode. It’s good to keep these factors in mind when developing a multimedia-rich page; many users are on smaller screens – tablets, phones and the like – which need to be given consideration in terms of their features and support.

Some clients also support a poster attribute on the video element, which takes a URL to a static poster frame image. This can sometimes be a good fallback, where a device may have video support but fails to decode or play the provided video assets. Some browsers don’t support the poster attribute, so in those instances you may want to listen for error events thrown from the video element. If it looks like the video can’t be played, you can use this event as a signal to hide the video element with an image of the poster frame URL.

Considerations for Tablets and Smaller Screens

The tl;dr of this section: Start with a simple CSS-only layout, and (carefully) progressively enhance your effects via JS depending on the type of device.

2014 Flickr Signed-Out Homepage
ALL THE SCREENS

Smaller devices don’t have the bandwidth, CPU or GPU of their laptop and desktop counterparts. Additionally, they typically do not fire resize and scroll events with the same rapid interval because they are optimized for touch and inertia-based scrolling. Therefore, it is best to avoid “scroll hijacking” entirely; instead, allow users to swipe or otherwise scroll through the page as they normally would.

Given the points about video support and auto-play not being allowed, the benefits offered by controlled scrolling are largely moot on smaller devices. Users who tap on videos will find that they do play where supported, in line with their experience on other web sites. The iPad with iOS 7 and some Samsung tablets, for example, are capable of playing inline video, but the iPhone will go to a full-screen view and then return to the web page when “done” is tapped.

Without controlled scrolling and regular scroll events being fired, the parallax, blur and zoom effects are also not appropriate to use on smaller screens. Even if scroll events were fired or a timer were used to force regular updates at a similar interval, the effects would be too heavy for most devices to draw at any reasonable frame rate. The images for these effects are also fairly large, contributing to page weight.

Rendering Performance

Much of what helped for this page was covered in the 2013 article, but is worth a re-tread.

  • Do as little DOM “I/O” as possible.

  • Cache DOM attributes that are expensive (cause layout) to read. Possible candidates include offsetWidth, offsetHeight, scrollTop, innerWidth, innerHeight etc.

  • Throttle your function calls, particularly layout-causing work, for listeners attached to window scroll and resize events as appropriate.

  • Use translate3d() for moving elements (i.e., fast parallax), and for promoting selected elements to layers for GPU-accelerated rendering.

It’s helpful to look at measured performance in Chrome’s DevTools “Timeline” / frames view, and the performance pane of IE 11′s “F12 Developer Tools” during development to see if there are any hotspots in your CSS or JS in particular. It can also be helpful to have a quick way to disable JS, to see if there are any expensive bits present just when scrolling natively and without regular events firing. JS aside, browsers still have to do layout, decode, resize and compositing of images for display, for example.

flickr-home-timeline

Chrome DevTools: Initial page load, and scroll-through. There are a few expensive image decode and resize operations, but overall the performance is quite smooth.

Flickr.com SOHP, IE 11 "F12 Developer Tools" Profiling

IE 11 + Windows 8.1, F12 Developer Tools: “UI Responsiveness” panel. Again, largely smooth with a few expensive frames here and there. The teal-coloured frames toward the middle are related to image decoding.

For the record, I found that Safari 7.0.3 on OS X (10.9.2) renders this page incredibly smoothly when scrolling, as seen in the demo videos. I suspect some of the overhead may stem from JS animating scrollTop. If I were to do this again, I might look at using a transition and applying something sneaky like translate3d() to move the whole page, effectively bypassing scrolling entirely. However, that would require eliminating the scrollbar altogether for usability.

What’s Next?

While a good number of Flickr users are on desktop or laptop browsers, tablets and mobile devices are here to stay. With a growing number of users on various forms of portable web browsers, designers and developers will have to work closely together to build pages that are increasingly fluid, responsive and performant across a variety of screens, platforms and device capabilities.

Flickr flamily floto

Did I mention we’re hiring? We have openings in our San Francisco office. Find out more at flickr.com/jobs.

New SSL Endpoints for the Flickr API

Sometime in the last few months, we went and updated our API SSL endpoints. Shame on us for not making a bigger deal about it!

In the past, to access the Flickr API via SSL you needed to use the “secure.flickr.com” subdomain… Not anymore!

Now calling the API via SSL is as easy as updating your code to call:

https://api.flickr.com/

In fact, it’s so easy that we want everyone to use it.

You’ll notice in the API documentation that all of the endpoints have been updated to https. While OAuth adds security by removing the need to authenticate by username and password, sending all traffic over SSL will further protect our users’ data while in transit.

The SSL endpoints for the Flickr API are:


https://api.flickr.com/services/rest/


https://api.flickr.com/services/soap/


https://api.flickr.com/services/xmlrpc/

And for uploads:


https://up.flickr.com/services/upload/


https://up.flickr.com/services/replace/

Later this year we will be migrating the Flickr API to SSL access only. We’ll let you know the exact date in advance, and will run a blackout test before the big day. For applications that use well established HTTP client libraries, this should only require updating the protocol and (maybe) some updated configuration. We’ll also be working with API kit developers, so updating many apps will be a git pull away.

Of course we realize that this change might be more difficult for some. We’ll be following the Developer Support Group closely, so please let us hear your questions, comments, and concerns.

This is the first step of many improvements that we’ll be making to the API and our developer tools this year, and we’ll post additional details and timelines as we go. Want to help? We’re hiring!

A Summer at Flickr

This summer I had the unforgettable opportunity to work side-by-side with the some of the smartest, photography-loving software engineers I’ve ever met. Looking back on my first day at Flickr HQ – beginning with a harmonious welcome during Flickr Engineering’s weekly meeting – I can confidently say that over the past ten weeks I have become a much better software engineer.

One of my projects for the summer was to build a new and improved locking manager that controls the distribution of locks for offline tasks (or OLTs for short). Flickr uses OLTs all the time for data migration, uploading photos, updates to accounts, and more. An OLT needs to acquire a lock on a shared resource, such as a group or an account, to prevent other OLTs from accessing the same resource at the same time. The OLT will then release the lock when it’s done modifying the shared data. Myles wrote an excellent blog post on how Flickr uses offline tasks here.

When building a distributed lock system, we need to take into account a couple of important details. First, we need to make sure that all the lock servers are consistent. One way to maintain consistency is to elect one server to act as a master and the remaining servers as slaves, where the master server is responsible for data replication among the slave servers. Second, we need to account for network and hardware failures – for instance, if the master server goes down for some reason, we need to quickly elect a new master server from one of the slave servers. The good news is, Apache ZooKeeper is an open-source implementation of master-slave data replication, automatic leader election, and atomic distributed data reads and writes.


Offline tasks send lock acquire and release requests through ZooLocker. ZooLocker in turn interfaces with the ZooKeeper cluster to create and delete znodes that correspond to the individual locks.

In the new locking system (dubbed “ZooLocker”), each lock is stored as a unique data node (or znode) on the ZooKeeper servers. When a client acquires a lock, ZooLocker creates a znode that corresponds to the lock. If the znode already exists, ZooLocker will tell the client that the lock is currently in use. When a client releases the lock, ZooLocker deletes the corresponding znode from memory. ZooLocker stores helpful debugging information, such as the owner of the lock, the host it was created on, and the maximum amount of time to hold on to the lock, in a JSON-serialized format in the znode. ZooLocker also periodically scans through each znode in the ZooKeeper ensemble to release locks that are past their expiration time.

My locking manager is already serving locks in production. In spite of sudden spikes in lock acquire and release requests by clients, the system holds up pretty well.


A graph of the number of lock acquire requests in ZooLocker per second

My summer internship at Flickr has been an incredibly valuable experience for me. I have demystified the process of writing, testing, and integrating code into a running system that millions of people around the world use each and every day. I have also learned about the amazing work going on in the engineering team, the ups and downs the code deploy process, and how to dodge the incoming flying finger rockets that the Flickr team members fling at each other.  My internship at Flickr is an experience I will never forget, and I am very grateful to the entire Flickr team for giving me the opportunity to work and learn from them this summer.

Pre-generating Justified Views

On May 20th, we introduced our Justified layout to the Photostream page. Ever since launch, we’ve been working hard to improve the performance of this page, and this week we’ve deployed a change that dramatically reduces the time it takes to display photos. The secret? Cookies.

At a high level, our Justified algorithm works like this:

  1. Take, as input, the browser viewport width and a list of photos
  2. Begin to lay those photos out in a row sequentially, using the maximum allowed height and scaling the width proportionately
  3. If a row becomes longer than the viewport width, reduce the height of that row and all the photos in it until the width is correct

Because we need the viewport width, we have to run this algorithm entirely in the browser. And because we won’t know which particular photo size to request until we’ve run the algorithm, we can’t start downloading the photos until very late in the process. This is why, up until Friday, when you loaded a photostream page, you saw the spinning blue and pink balls before the photos loaded.

Last week we were able to make one key change: we now pre-generate the layout on the server. This means that we know exactly which image sizes we need at the very top of the page, and can start downloading them immediately. It also means the spinning balls aren’t needed anymore. The end result is that the first photo on the page now loads seven times faster than on May 20th.

“Time to First Photo” on the Photostream page

One question remains: we need client viewport width in order to generate the layout, so how are we able to pre-generate it on the server? The first time you come to any Flickr page, we store the width of your browser window in a cookie. We can then read that cookie on the server on subsequent page loads. This means we aren’t able to pre-generate the photostream layout the very first time you come to the site. It also means that the layout will occasionally be incorrect, if you have resized the browser window since the last time you visited Flickr; we deal with this by always correcting the layout on the client, if a mismatch is detected.

This is one of many performance improvements we’re working on after our 5/20 release (we’ve also deployed some improvements to the homepage activity feed). Expect to see the performance continue to improve on the redesigned pages in the coming weeks and months.

Adventures in Jank Busting: Parallax, performance, and the new Flickr Home Page

tl;dr version: transform3d() is your friend, but use sparingly. Chrome Dev Tools are awesome.

What’s old is new again: Stealing from the arcade

Back in 1985, games like Super Mario Bros. popularized the effect of horizontal parallax scrolling – a technique wherein the background moves at a slower speed relative to the foreground, giving the impression of depth. In 2013, the web is seeing a trend of vertical parallax effects as developers look to make pages feel more interactive and responsive. Your author’s $0.25, for the record, is that we’ll continue to see arcade and demoscene-era effects being ported over to the mainstream web in creative ways as client-side performance improves.

While the effect can be aesthetically pleasing when executed correctly, parallax motion tends to be expensive to render – and when performance is lacking, the user’s impression of your site may follow suit.

Vertical parallaxing is pretty straightforward in behaviour: For every Y pixels of vertical axis scrolling, move an absolutely-positioned image in the same direction at scale. There are some additional considerations for the offset of the element on the page and how far it can move, but the implementation remains quite simple.

The following are some findings made while working on Flickr’s redesigned signed-out homepage at flickr.com/new/, specifically related to rendering and scrolling performance.

Events and DOM performance

To optimize performance in the browser environment, it’s important to consider the expensive parts of DOM “I/O”. You ideally want a minimal amount of both, particularly since this is work being done during scrolling. Executing JavaScript on scroll is one of the worst ways to interrupt the browser, typically because it’s done to introduce reflow/layout and painting – thus, denying the browser the chance to use GPU/hardware-accelerated scrolling. window.onscroll() can also fire very rapidly on desktops, making way for a veritable flood of expensive scroll → reflow → paint operations if your “paths” are not fast.

A typical parallax implementation will hook into window.onscroll(), and will update the backgroundPosition or marginTop of an element with a background image attached in order to make it move. An <img> could be used here, but backgrounds are convenient because they allow for positioning in relative units, tiling and so on.

A minimal parallax example, just the script portion:

window.onscroll = function(e) {

  var parallax = document.getElementById('parallax-background');

  parallax.style.marginTop = (window.scrollY/2) + 'px';

}

This could work for a single element, but quickly breaks down if multiple elements are to be updated. In any case, references to the DOM should be cached for faster look-ups; reading window.scrollY and other DOM attributes can be expensive due to potential to cause layout/reflow themselves, and thus should also be stored in a local variable for each onscroll() event to minimize thrashing.

An additional performance consideration: Should all parallax elements always be moved, even those which are outside of the viewport? Quick tests suggested savings were negligible in this case at best. Even if the additional work to determine in-view elements were free, moving only one element did not notably improve performance relative to moving only three at a time. It appears that Webkit’s rendering engine is smart about this, as the only expensive operations seen here involve painting things within the viewport.

In any event, using marginTop or backgroundPosition alone will not perform well as neither take advantage of hardware-accelerated compositing.

And now, VOIDH (Video Or It Didn’t Happen) of marginTop-based parallax performing terribly:

Look at that: Terrible. Jank city! You can try it for yourself in your browser of choice, via via ?notransform=1.

Enter the GPU: Hardware Acceleration To The Rescue

Despite caching our DOM references and scroll properties, the cost of drawing all these pixels in software is very high. The trick to performance here is to have the GPU (hardware) take on the job of accelerating the compositing of the expensive bits, which can be done much faster than in software rendering.

Elements can be promoted to a “layer” in rendering terms, via CSS transforms like translate3d(). When this and other translateZ()-style properties are applied, the element will then be composited by the GPU, avoiding expensive repaints. In this case, we are interested in having fast-moving parallax backgrounds. Thus, translate3d() can be applied directly to the parallax element.

window.onscroll = function(e) {

  // note: most browsers presently use prefixes: webkitTransform, mozTransform etc.

  var parallax = document.getElementById('parallax-background');

  parallax.style.transform = 'translate3d(0px,' + (window.scrollY/2) + 'px, 0px)';

}

In webkit-based browsers like Safari and Chrome, the results speak for themselves.

Look, ma! Way less jank! The performance here is comparable to the same page with parallax disabled (i.e., regular browser scrolling/drawing.)

GPU acceleration sounds like a magic bullet, but it comes at the cost of both video memory and cache. If you create too many layers, you may see rendering problems and even degraded performance – so use them selectively.

It’s also worth noting that browser-based hardware acceleration and performance rests on having agreement between the browser, drivers, OS, and hardware. Firefox might be sluggish on one machine, and butter-smooth on another. High-density vs. standard-resolution screens make a big difference in paint cost. All GPUs are made equal, but some GPUs are made more equal than others. The videos and screenshots for this post were made on my work laptop, which may perform quite differently than your hardware. Ultimately, you need to test your work on a variety of devices to see what real-world performance is like – and this is where Chrome’s dev tools come in handy.

Debugging Render Performance

In brief, Chrome’s Developer Tools are awesome. Chrome Canary typically has the freshest features in regards to profiling, and Safari also has many of the same. The features most of interest to this entry are the Timeline → Frames view, and the gear icon’s “Show Paint rectangles” and “Show composited layer borders” options.



Timeline → Frames view: Helpful in identifying expensive painting operations.



Paint rectangles + composited layer borders, AKA “Plaid mode.” Visually identify layers.

Timeline → Frames view

Timeline’s “Frames” view allows you to see how much time is spent calculating and drawing each frame during runtime, which gives a good idea of rendering performance. To maintain a refresh rate of 60 frames per second, each frame must take no longer than 16 milliseconds to draw. If you have JS doing expensive things during scroll events, this can be particularly challenging.

Expensive frames in Flickr’s case stem primarily from occasional decoding of JPEGs and non-cached image resizes, and more frequently, compositing and painting. The less of each that your page requires, the better.

Paint rectangles

It is interesting to see what content is being painted (and re-painted) by the browser, particularly during scroll events. Enabling paint rectangles in Chrome’s dev tools results in red highlights on paint areas. In the ideal case when scrolling, you should see red only around the scrollbar area; in the best scenario, the browser is able to efficiently move the rest of the HTML content in hardware-accelerated fashion, effectively sliding it vertically on the screen without having to perform expensive paint operations. Script-based DOM updates, CSS :hover and transition effects can all cause painting to happen when scrolling, so keep these things in mind as well.

Composited layer borders

As mentioned previously, layers are elements that have been promoted and are to be composited by the GPU, via CSS properties like translate3d(), translateZ() and so on. With layer borders enabled, you can get a visual representation of promoted elements and review whether your CSS is too broad or too specific in creating these layers. The browser itself may also create additional layers based on a number of scenarios such as the presence of child elements, siblings, or elements that overlap an existing layer.

Composited borders are shown in brown. Cyan borders indicate a tile, which combines with other tiles to form a larger composited layer.

Other notes

Image rendering costs

When using parallax effects, “full-bleed” images that cover the entire page width are also popular. One approach is to simply use a large centered background image for all clients, regardless of screen size; alternately, a responsive @media query-style approach can be taken where separate images are cut to fit within common screen widths like 1024, 1280, 1600, 2048 etc. In some cases, however, the single-size approach can work quite nicely.

In the case of the Flickr homepage, the performance cost in using 2048-pixel-wide background images for all screens seemed to be negligible – even in spite of “wasted” pixels for those browsing at 1024×768. The approach we took uses clip-friendly content, typically a centered “hero” element with shading and color that extends to the far edges. Using this approach, the images are quite width-agnostic. The hero-style images also compress quite nicely as JPEGs thanks to their soft gradients and lighting; as one example, we got a 2048×950-pixel image of a flower down to 68 KB with little effort.

Bandwidth aside, the 2048-pixel-wide images clip nicely on screens down to 1024 pixels in width and with no obvious flaws. However, Chrome’s dev tools also show that there are costs associated with decoding, compositing, re-sizing and painting images which should be considered.

Testing on my work laptop*, “Image resized (non-cached)” is occasionally shown in Chrome’s timeline → frames view after an Image Decode (JPEG) operation – both of which appear to be expensive, contributing to a frame that took 60 msec in one case. It appears that this happens the first time a large parallax image is scrolled into the viewport. It is unclear why there is a resize (and whether it can be avoided), but I suspect it’s due to the retina display on this particular laptop. I’m not using background-size or otherwise applying scaling/sizing in CSS, merely positioning eg., background-position:50% 50%; and background-repeat: no-repeat;. As curiosity sets in, this author will readily admit he has some more research to do on this front. ;)

There are also aspects to RAM and caching that can affect GPU performance. I did not dig deeply into this specifically for Flickr’s new homepage, but it is worth considering the impact of the complexity and number of layers present and active at any given time. If regular scrolling triggers non-cached image resizes each time an asset is scrolled into view, there may be a cache eviction problem stemming from having too many layers active or present at once.

* Work laptop: Mid-2012 15″ Retina MBP, 16 GB RAM, 2.6 Ghz Intel i7, NVIDIA GeForce GT 650M/1024 MB, OS X 10.8.3.

Debugging in action

Here are two videos showing performance of flickr.com/new/ with transforms disabled and enabled, respectively.

Transforms off (marginTop-based parallax)

Notice the huge spikes on the timeline with transforms disabled, indicating many frames that are taking up to 80 msec to draw; this is terrible for performance, “blowing the frame budget” of 16 ms, and lowers UI responsiveness significantly. Red paint rectangles indicate that the whole viewport is being repainted on scroll, a major contributor to performance overhead. With compositing borders, you see that every “strip” of the page – each parallax background, in effect – is rendered as a single layer. A quick check of the FPS meter and “continuous repaint” graphs does not look great.

Side note: Continuous repaint is most useful when not scrolling. The feature causes repeated painting of the viewport, and displays an FPS graph with real-time performance numbers. You can go into the style editor while continuous repaint is on and flip things off, e.g., disabling box-shadow, border-radius or hiding expensive elements via display to see if the frame rate improves.

Transforms on (translate3d()-based parallax)

With GPU acceleration, you see much-improved frame times, thus a higher framerate and a smoother, more-responsive UI. There is still the occasional spike when a new image scrolls into view, but there is much less jank overall than before. Paint rectangles are much less common thanks to the GPU-accelerated compositing, and layer borders now indicate that more individual elements are being promoted. The FPS meter and continuous repaint mode does have a few dips and spikes, but performance is notably improved.

You may notice that I intentionally trigger video playback in this case, to see how it performs. The flashing red is the result of repainting as the video plays back – and in spite of overflow: hidden-based clipping we apply for the parallax effect on the video, it’s interesting to notice that the overflowed content, while not visible, is also being painted.

Miscellany

Random bits: HTML5 <video>



A frame from a .webm and H.264-encoded video, shown in the mobile portion of Flickr’s redesigned home page.

We wanted the signed-out Flickr homepage to highlight our mobile offerings, including an animation or video showing a subtly-rotating iPhone demoing the Flickr iOS app on a static background. Instead of an inline video box, it felt appropriate to have a full-width video following the pattern used for the parallax images. The implementation is nearly the same, and simply uses a <video> element instead of a CSS background.

With video, the usual questions came up around performance and file size: Would a 2048-pixel-wide video be too heavy for older computers to play back? What if we cropped the video only to be as wide as needed to cover the area being animated (eg., 500 pixels), and used a static JPEG background to fill in the remainder of the space?

As it turned out, encoding a 2048×700 video and positioning it like a background element – even including a slight parallax effect – was quite reasonable. Playback was flawless on modern laptops and desktops, and even a 2006-era 1.2 GHz Fujitsu laptop running WinXP was able to run the video at reasonable speed. Per rendering documentation from the Chrome team, <video> elements are automatically promoted to layers for the GPU where applicable and thus benefit from accelerated rendering. Due to the inline nature of the video, we excluded it from display on mobile devices, and show a static image to clients that don’t support HTML5 video.

Perhaps the most interesting aspect of the video was file size. In our case, the WebM-encoded video (supported natively in Chrome, Firefox, and Opera) was clearly able to optimize for the low amount of motion within the wide frame, as eight seconds of 2048×700 video at 24 fps resulted in a .webm file of only 900 KB. Impressive! By comparison, the H.264-encoded equivalent ended up being about 3.8 MB, with a matching data rate of ~3.8 mbps.

The “Justified” View

It’s worth mentioning that the Justified photos at the bottom of the page lazy-load in, and have been excluded from any additional display optimizations in this case. There is an initial spike with the render and subsequent loading of images, but things settle down pretty quickly. Blindly assigning translate-type transforms to the Justified photo container – a complex beast in and of itself – causes all sorts of rendering hell to break loose.

In Review

This article represents my findings and approach to getting GPU-accelerated compositing working for background images in the Webkit-based Chrome and Safari browsers, in May 2013. With ever-changing rendering engines getting smarter over time, your mileage may vary as the best route to the “fast path” changes. As numerous other articles have said regarding performance, “don’t guess it, test it.”

To recap:

  • Painting: Expensive. Repaints should be minimal, and limited to small areas. Reduce by carefully choosing layers.
  • Compositing: Good! Fast when done by the GPU.
  • Layers: The secret to speed, when done correctly. Apply sparingly.

References / further reading:

… And finally, did I mention we’re hiring? (hint: view-source :))

Using Redis as a Secondary Index for MySQL

Hey, did you notice, on the brand-spanking-new Yahoo homepage, right there on the side of the page, it’s photos from your Flickr contacts (or maybe your groups)! No? Go check it out, I’ll wait.

Ok, great, you’re back! What you should have seen, assuming you have Flickr contacts (or are a member of some groups), is photos from your most recently active contact (or group!). Something like… this:

Flickr on Yahoo.com

(thanks schill!)

What you see above is the 10 most recent photos from my contact who most recently uploaded any photos. The homepage retrieves this data by making a call to a specially tailored method from the Flickr API.

The Latency Problem

In order to ensure performance for Yahoo.com, this API method had very tight SLAs; we can’t slow down the page for the millions of homepage visitors to pull in Flickr content, after all. While the method can return different data sets depending on the most recent activity from your Flickr contacts and groups, for the sake of this post we’re going to focus exclusively on the contacts case. The first step to returning data is to get your most recently active contact, and to do that we need a list of all of your contacts sorted by how recently they’ve uploaded a photo. In an ideal world, the SQL query would be something along the lines of:

SELECT contact_id FROM Contacts WHERE user_id = ? ORDER BY date_upload LIMIT 1;

Easy, right? Sadly, no. Due to the way our contact relationships are stored, the query we need to run is more complex than the above. Still, for the vast majority of Flickr users, the *actual* SQL query performed just fine, usually in less than 1ms. However, Flickr users come in all shapes and sizes. Some users have a few contacts, some have hundreds, and some… have tens of thousands. Unfortunately, the runtime for the query scaled proportionally to number of contacts the user has. I won’t delve into the specifics of the query or how we store contact relationships, but suffice it to say we investigated possible changes to both the query and to the indexes in MySQL, but neither was sufficient to give us the performance we needed across all possible use cases. With that in mind, we had to look elsewhere for optimizations.

The First Attempt

With a pure MySQL solution off the table, our first thoughts for optimization avenues turned to the obvious: denormalization. If we stored the 10 most recent photos from your most recently active contact ahead of time, then getting that list when the homepage was rendered would be trivial regardless of how many contacts you have. At Flickr we’re big fans of Redis, so our thoughts immediately turned to using it as a store for the denormalized contact data.

In order to denormalize the data, we need to constantly process six different user actions:

  • Add a contact
  • Remove a contact
  • Change a contact relationship
  • User uploads a photo
  • User deletes a photo
  • User changes a photo’s privacy

Each one of these actions can impact what photos you should see on the Yahoo homepage, and therefore must update the denormalized data appropriately. Because the photo related actions can potentially impact thousands of users, if not tens or hundreds of thousands (everyone who calls the photo owner a contact would need to have their denormalized data updated on upload), they must be processed by our offline task system to ensure site performance is not adversely impacted. Unfortunately, this is where we ran into a snag. The nature of Flickr’s offline task system is such that the order of processing is not guaranteed. In most cases this is not a problem, however when it comes to maintaining an accurate list of denormalized data not being able to predict the outcome of running multiple tasks is problematic.

Out of Order

Imagine the following scenario: you add a contact then quickly remove them. This results in two tasks being added, one to add and one to remove. If they’re processed in the order in which the actions were taken, everything is fine. However, if they’re processed out of order then when everything is said and done the denormalized data no longer accurately reflects your actual contact relationships. Maybe the next action that updates your data will correct this problem, or maybe it will introduce another problem. Over time, it’s likely that a not-insignificant number of users will have denormalized data that is out of sync with reality.


Out of order

Out of order by Foomandoonian

Ok, let’s try this another way

Looking back at our original problem, the bottleneck was not in generating the list of photos for your most recently active contact, it was just in finding who your most recently active contact was (specifically if you have thousands or tens of thousands of contacts). What if, instead of fully denormalizing, we just maintain a list of your recently active contacts? That would allow us to optimize the slow query, much like a native MySQL index would; instead of needing to look through a list of 20,000 contacts to see which one has uploaded a photo recently, we only need to look at your most recent 5 or 10 (regardless of your total contacts count)!

In the end, this is exactly what we ended up doing. We still process offline tasks to keep track of your contacts’ activity, but now the set of actions we need to track is smaller:

  • Add a new contact
  • Remove a contact
  • User uploads a photo

Each task maintains a per-user sorted set where the set member is the contact id, and the score is the timestamp of when that contact last uploaded a photo. So, for example, if a user (user id 12345) adds a new contact (user id 98765) we simply do:

ZADD user_12345 1363665432 98765

Removing a contact is the opposite, using ZREM as the yin to ZADD‘s yang:

ZREM user_12345 98765

When a user uploads a new photo it’s once again a ZADD (though it’s a ZADD against the set for every user that calls the uploading user a contact). If the uploader is already in a given user’s set, this will simply update the score, if not then the uploader will be added to the set.

As things stand right now, the set will grow without bound, which is obviously not much help if our goal is to limit the number of contacts we need to check against when querying the DB. The solution to this is to cap the set. After each ZADD we check to see if the size of the set has exceeded a threshold (generally we store an additional 20% on top of the data we absolutely need), and if so, we remove all of the extra records using ZREMRANGEBYRANK:

ZREMRANGEBYRANK user_12345 0 ($collection_size - $max_size) - 1

Where $collection_size is the current number of members in the set and $max_size is the maximum number of members we want to store. Note that we’re removing from the head of the set. Redis stores data in sorted sets in ascending order, so the least recently active contacts are at the beginning of the set.

Akin to how we must cap the set to keep it below a maximum size, we also have a threshold on the other end of the spectrum to keep the set above a minimum size. If a user happens to remove their 10 most recently active contacts then there would be no data in their set, and the Redis index would be of little value. With that in mind, any time the user removes a contact, we check to see if the size of the set has dropped under the minimum threshold, and if so we repopulate the data based on their remaining contacts. This is slightly more complex than a simple Redis command, so we’ll use actual PHP to explain:

//
// $key is the redis ZSET key, e.g. user_12345
//
$count = redis_zset_zcard($key);
if ($count < $MIN_SET_SIZE) {
      if ($count > 0) {
        $current_contacts = redis_zset_zrange($key, 0, -1);
    } else {
        $current_contacts = array();
    }

    //
    // In this call, the second parameter is the number of contacts
    // to return and the third parameter is a list of users to
    // exclude from the response. Since they’re already in the
    // set, there’s no need to add them again
    //
    $contacts = contacts_most_recently_uploaded_list($user, $MAX_SET_SIZE - $count,
        $current_contacts);

    foreach ($contacts as $contact) {
        redis_zset_zadd($key, $contact['last_upload'], $contact['id']);
    }
}

Remember, this is all being done outside of the context of a page request, so there’s no harm in spending a little bit of extra time to ensure when the API is called we have a reliable index that we can use to optimize the DB query.

The final action we need to take is to actually query the set to get the list of contacts so we can actually do said DB query optimization. This is done with a standard ZREVRANGE:

ZREVRANGE user_12345 0 10

Similar to how we cap the set by removing members from the beginning because those are the least recently active, when we want the most recently active we use ZREVRANGE to get members at the end of the set.

You’re probably wondering, what about the other three events that generated tasks for the fully denormalized solution? How are we able to get by without them? Well, because we’ll just be using the list of contacts to optimize a live DB query, we can take some liberties with data purity in Redis. Because we store multiple recently active contacts, it doesn’t matter if, for example, the most recent contact in your Redis set has deleted his most recent photos or made them all private. When we query the DB, we further restrict the list based on your current relationship with a contact and photo-visibility, so any issues with Redis being out of sync sort themselves out automatically.

Take the above scenario where a user adds and quickly removes a contact. If the remove is processed first and the contact remains in the user’s Redis set, when we go to query the live contacts DB, we’ll get no results for that relationship and move on to the next contact in the set—crisis averted. There is a slight problem with the reverse scenario wherein a user removes a contact and then adds them back. If the tasks are processed out of order, there’s a chance that the add may not end up being reflected in the Redis set. This is the only chance for corruption under this system (as opposed to the fully denormalized solution where many tasks could interact and corrupt one another), and it’s likely to be fixed the next time any of the other actions occur. Furthermore, the only downside of this corruption is a user seeing photos from their second most recently active contact; this is a condition we’re willing to live with given the overall gains provided by the solution as a whole.

Not All Wine and Roses

The dataset involved in this solution is one of the largest we’ve pushed at Redis so far, and it’s not without its pitfalls. Namely, as the size of the dataset increases, the amount of time spent doing RDB saves also increases. This can introduce latency into Redis commands while the save is in progress. From Redis Persistence:

RDB needs to fork() often in order to persist on disk using a child process. Fork() can be time consuming if the dataset is big, and may result in Redis to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance not great. AOF also needs to fork() but you can tune how often you want to rewrite your logs without any trade-off on durability.

This is a key factor to be aware of when optimizing Redis performance. The overall speed of the system can be adversely impacted by RDB saves, therefore taking steps to minimize the time spent saving is critical. We’ve solved this problem by isolating these writes to their own Redis instance, thereby limiting the size of the dataset to only keys related to contacts activity. In the long term, as activity increases, it’s likely that we’ll need to further reduce the number of writes-per-instance by sharding this contact activity to a number of Redis instances. In some cases, multiple small Redis instances running on a single host can be preferable to one large Redis instance. As mentioned in the Redis Persistence guide, RDB saves can suffer with a slow CPU; if upgrading hardware (mostly faster CPUs to improve fork() performance), is possible, that’s certainly another option to investigate.

Flickr flamily floto

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at flickr.com/jobs.

Redis Global Locks Redux

In my last post I described how we use Redis to manage a global lock that allows us to automatically failover to a backup process if there was a problem in the primary process. The method described allegedly allowed for any number of backup processes to work in conjunction to pick up on primary failures and take over processing.

Locks #1
Locks #1 by Christoph Kummer

Thanks to an astute reader, it was pointed out that the code in the blog wouldn’t actually work as advertised:

 

The Problem

Nolan correctly noticed that when the backup processes attempts to acquire the lock via SETNX, that lock key will already exist from when it was acquired by the primary, and thus all subsequent attempts to acquire locks will simply end up constantly trying to acquire a lock that can never be acquired. As a reminder, here’s what we do when we check back on the status of a lock:

function checkLock(payload, lockIdentifier) {
    client.get(lockIdentifier, function(error, data) {
        // Error handling elided for brevity
        if (data !== DONE_VALUE) {
            acquireLock(payload, data + 1, lockCallback);
        } else {
            client.del(lockIdentifier);
        }
    });
}

And here’s the relevant bit from acquireLock that calls SETNX:

    client.setnx(lockIdentifier, attempt, function(error, data) {
        if (error) {
            logger.error(&quot;Error trying to acquire redis lock for: %s&quot;, lockIdentifier);
            return callback(error, dataForCallback(false));
        }

        return callback(null, dataForCallback(data === 1));
    });

So, you’re thinking, how could this vaunted failover process ever actually work? The answer is simple: the code from that post isn’t what we actually run. The actual production code has a single backup process, so it doesn’t try to re-acquire the lock in the event of failure, it just skips right to trying to send the message itself. In the previous post, I described a more general solution that would work for any number of backup processes, but I missed this one important detail.

That being said, with some relatively minor changes, it’s absolutely possible to support an arbitrary number of backup processes and still maintain the use of the global lock. The trivial solution is to simply have the backup process delete the key before trying to re-acquire the lock (or, technically acquire it anew). However, the problem with that becomes apparent pretty quickly. If there are multiple backup processes all deleting the lock and trying to SETNX a new lock again, there’s a good chance that a race condition could arise wherein one of backups deletes a lock that was acquired by another backup process, rather than the failed lock from the primary.

The Solution

Thankfully, Redis has a solution to help us out here: transactions. By using a combination of WATCH, MULTI, and EXEC, we can perform actions on the lock key and be confident that no one has modified it before our actions can complete. The process to acquire a lock remains the same: many processes will issue a SETNX and only one will win. The changes come into play when the processes that didn’t acquire the lock check back on its status. Whereas before, we simply checked the current value of the lock key, now we must go through the above described Redis transaction process. First we watch the key, then we do what amounts to a check and set (albeit with a few different actions to perform based on the outcome of the check):

function checkLock(payload, lockIdentifier, lastCount) {
    client.watch(lockIdentifier);
    client.multi()
        .get(lockIdentifier)
        .exec(function(error, replies) {
            if (!replies) {
                // Lock value changed while we were checking it, someone else got the lock
                client.get(lockIdentifier, function(error, newCount) {
                    setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, newCount);
                });

                return;
            }

            var currentCount = replies[0];
            if (currentCount === null) {
                // No lock means someone else completed the work while we were checking on its status and the key has already been deleted
                return;
            } else if (currentCount === DONE_VALUE) {
                // Another process completed the work, let’s delete the lock key
                client.del(lockIdentifier);
            } else if (currentCount == lastCount) {
                // Key still exists, and no one has incremented the lock count, let’s try to reacquire the lock
                reacquireLock(payload, lockIdentifier, currentCount, doWork);
            } else {
                // Key still exists, but the value does not match what we expected, someone else has reacquired the lock, check back later to see how they fared
                setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, currentCount);
            }
        });
}

As you can see, there are five basic cases we need to deal with after we get the value of the lock key:

  1. If we got a null reply back from Redis, that means that something else changed the value of our key, and our exec was aborted; i.e. someone else got the lock and changed its value before we could do anything. We just treat it as a failure to acquire the lock and check back again later.
  2. If we get back a reply from Redis, but the value for the key is null, that means that the work was actually completed and the key was deleted before we could do anything. In this case there’s nothing for us to do at all, so we can stop right away.
  3. If we get back a value for the lock key that is equal to our sentinel value, then someone else completed the work, but it’s up to us to clean up the lock key, so we issue a Redis DEL and call our job done.
  4. Here’s where things get interesting: if the key still exists, and its value (the number of attempts that have been made) is equal to our last attempt count, then we should try and reacquire the lock.
  5. The last scenario is where the key exists but its value (again, the number of attempts that have been made) does not equal our last attempt count. In this case, someone else has already tried to reacquire the lock and failed. We treat this as a failure to acquire the lock and schedule a timeout to check back later to see how whoever did acquire the lock got on. The appropriate action here is debatable. Depending on how long your underlying work takes, it may be better to actually try and reacquire the lock here as well, since whoever acquired the lock may have already failed. This can, however, lead to premature exhaustion of your attempt allotment, so to be safe, we just wait.

So, we’ve checked on our lock, and, since the previous process with the lock failed to complete its work, it’s time to actually try and reacquire the lock. The process in this case is similar to the above inasmuch as we must use Redis transactions to manage the reacquisition process, thankfully however, the steps are (somewhat) simpler:

function reacquireLock(payload, lockIdentifier, attemptCount, callback) {
    client.watch(lockIdentifier);
    client.get(lockIdentifier, function(error, data) {
        if (!data) {
            // Lock is gone, someone else completed the work and deleted the lock, nothing to do here, stop watching and carry on
            client.unwatch();
            return;
        }

        var attempts = parseInt(data, 10) + 1;

        if (attempts &gt; MAX_ATTEMPTS) {
            // Our allotment has been exceeded by another process, unwatch and expire the key
            client.unwatch();
            client.expire(lockIdentifier, ((LOCK_EXPIRY / 1000) * 2));
            return;
        }

        client.multi()
            .set(lockIdentifier, attempts)
            .exec(function(error, replies) {
                if (!replies) {
                    // The value changed out from under us, we didn't get the lock!
                    client.get(lockIdentifier, function(error, currentAttemptCount) {
                        setTimeout(checkLock, LOCK_TIMEOUT, payload, lockIdentifier, currentAttemptCount);
                    });
                } else {
                    // Hooray, we acquired the lock!
                    callback(null, {
                        &quot;acquired&quot; : true,
                        &quot;lockIdentifier&quot; : lockIdentifier,
                        &quot;payload&quot; : payload
                    });
                }
            });
    });
}

As with checkLock we start out by watching the lock key, and proceed do a (comparitively) simplified check and set. In this case, we’ve “only” got three scenarios to deal with:

  1. If we’ve already exceeded our allotment of attempts, it’s time to give up. In this case, the allotment was actually exceeded in another worker, so we can just stop right away. We make sure to unwatch the key, and set it expire at some point far enough in the future that any remaining processes attempting to acquire locks will also see that it’s time to give up.

Assuming we’re still good to keep working, we try and update the lock key within a MULTI/EXEC block, where we have our remaining two scenarios:

  1. If we get no replies back, that again means that something changed the value of the lock key during our transaction and the EXEC was aborted. Since we failed to acquire the lock we just check back later to see what happened to whoever did acquire the lock.
  2. The last scenario is the one in which we managed to acquire the lock. In this case we just go ahead and do our work and hopefully complete it!

Bonus!

To make managing global locks even easier, I’ve gone ahead and generalized all the code mentioned in both this and the previous post on the subject into a tidy little event based npm package: https://github.com/yahoo/redis-locking-worker. Here’s a quick snippet of how to implement global locks using this new package:

var RedisLockingWorker = require(&quot;redis-locking-worker”);

var SUCCESS_CHANCE = 0.15;

var lock = new RedisLockingWorker({
    &quot;lockKey&quot; : &quot;mylock&quot;,
    &quot;statusLevel&quot; : RedisLockingWorker.StatusLevels.Verbose,
    &quot;lockTimeout&quot; : 5000,
    &quot;maxAttempts&quot; : 5
});

lock.on(&quot;acquired&quot;, function(lastAttempt) {
    if (Math.random() &lt;= SUCCESS_CHANCE) {
        console.log(&quot;Completed work successfully!&quot;, lastAttempt);
        lock.done(lastAttempt);
    } else {
        // oh no, we failed to do work!
        console.log(&quot;Failed to do work&quot;);
    }
});
lock.acquire();

There’s also a few other events you can use to track the lock status:

lock.on(&quot;locked&quot;, function() {
    console.log(&quot;Did not acquire lock, someone beat us to it&quot;);
});

lock.on(&quot;error&quot;, function(error) {
    console.error(&quot;Error from lock: %j&quot;, error);
});

lock.on(&quot;status&quot;, function(message) {
    console.log(&quot;Status message from lock: %s&quot;, message);
});

More Bonus!

If you don’t need the added complexity if multiple backup processes, I also want to give credit to npm user pokehanai who took the methodology described in the original post and created a generalized version of the two-worker solution: https://npmjs.org/package/redis-paired-worker.

Wrapping Up

So there you have it! Coordinating work on any number of processes across any number of hosts couldn’t be easier! If you have any questions or comments on this, please feel free to follow up on Twitter.

Flickr flamily floto

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at flickr.com/jobs.

Highly Available Real Time Push Notifications and You

One of the goals of our recently launched (and awesome!) new Flickr iPhone app was to further increase user engagement on Flickr. One of the best ways to drive engagement is to make sure Flickr users know what’s happening on Flickr in as near-real time as possible. We already have email notifications, but email is no longer a good mechanism for real-time updates. Users may have many email accounts and may not check in frequently causing timeliness to go right out the window. Clearly this called for… PUSH NOTIFICATIONS!

Motor bike racer getting a push start at the track, Brisbane
Motor bike racer getting a push start at the track, Brisbane by State Library of Queensland, Australia

I know, you’re thinking, “anyone can build push notifications, we’ve been doing it since 2009!” Which is, of course, absolutely true. The process for delivering push notifications is well trod territory by this point. So… let’s just skip all that boring stuff and focus on how we decided on the underlying architecture for our implementation. Our decisions focused on four major factors:

  1. Impact to normal page serving times should be minimal
  2. Delivery should be in near-real time
  3. Handle thousands of notifications per second
  4. The underlying services should be highly available

Baby Steps

Given these goals, we started by looking at systems we already have in place. Everyone loves not writing new code, right? Our thoughts immediately went to Flickr’s existing PuSH infrastructure. Our PuSH implementation is a great way to get an overview of relevant activity on Flickr, but it has limitations that made it unsuitable for powering mobile push notifications. The primary concern is that it’s less-near-real time than we’d like it to be. On average, activities occurring on Flickr will be delivered to a subscribed PuSH endpoint within one minute. That’s certainly better than waiting for an email to arrive or waiting until the next time you log in to the site and see your activity feed, but it’s not good enough for mobile notifications! This delay is due to some design decisions at the core of the PuSH system. PuSH is designed to aggregate activity and deliver a periodic digest and, because of this, it has a built in window to allow multiple changes to the same photo to be accumulated. PuSH is also focused on ensured delivery, so it maintains an up to date list of all subscribers. These features, which make PuSH great for the purpose it was designed, make it not-so-great for real time notifications. So, repurposing the PuSH code for reuse in a more real time fashion proved to be untenable.

Tentative Plans

So, what to do? In the end we wound up building a new lightweight event system that is broken up into three phases:

  1. Event Generation
  2. Event Targeting
  3. Message Delivery

Event Generation

The event generation phase happens while processing the response to a user request. As such, we wanted to ensure that there was little to no impact on the response times as a result. To ensure this was the case, all we do here is a lightweight write into a global Redis queue. We store the minimum amount of data possible, just a few identifiers, so we don’t have to make any extra DB calls and slow down the response just to (potentially) kick off a push notification. Everything after this initial Redis action is processed out of band by our deferred task system and has no impact on site performance.

Event Targeting

Next in the process is the event targeting phase. Here we have many workers reading from the global Redis queue. When a worker receives an event from the queue it rehydrates the data and loads up any additional information necessary to act on the notification. This includes checking to see what users should be notified, whether those users have devices that are registered to receive notifications, if they’ve opted out of notifications of this type, and finally if they’ve muted activity for the object in question.

Message Delivery

Flickr’s web-serving stack is PHP, and, up until now, everything described has been processed by PHP. Unfortunately, one area where PHP does not excel is long-lived processes or network connections, both of which make delivering push notifications in real time much easier. Because of this we decided to build the final phase, message delivery, as a separate endpoint in Node.js.

So, the question arose: how do we get messages pending delivery from these PHP workers over to the Node.js endpoints that will actually deliver them? For this, we again turned to Redis, this time using its built in pub/sub functionality. The PHP workers simply publish a message to a Redis channel with the assumption that there’s a Node.js process subscribed to that channel eagerly awaiting some data on which it can act.

After that the Node process delivers the notification to Apple’s APNS push notification system. Communicating with APNS is a well-documented topic, and not one that’s particularly interesting. In fact, I can sum it up with a single link: https://github.com/argon/node-apn, a great npm package for talking to APNS.

The Real Challenge

There is, however, a much more interesting problem to discuss at this point: how do we ensure that delivery to APNS is both scalable and highly available? At first blush, this seems like it could be problematic. What if the Node.js worker has crashed? The message will just be lost to the ether! Solving this problem turned out to be the majority of the work involved in implementing push notifications.

Scalability

The first step to ensuring a service is scalable is to divide the workload. Since Node.js is single threaded, we would already be dividing the workload across individual Node.js processes anyway, so this works out well! When we publish messages to the Redis pub/sub channel, we simply publish to a sharded channel. Each Node.js process subscribes to some subset of those sharded channels, and so will only act on that subset of messages.

APNS, Redis Pub/Sub

Configuring our Node.js processes in this way makes it easy to scale horizontally. Whenever we need to add more processing power to the cluster, we can just add more servers and more shards. This also makes it easy to pull hosts out of rotation for maintenance without impacting message delivery: we simply reconfigure the remaining processes to subscribe to additional channels to pick up the slack.

Availability

Designing for high availability proved to be somewhat more challenging. We needed to ensure that we could lose individual Node processes, a whole server or even an entire data center without degrading our ability to deliver messages. And we wanted to avoid the need for a human in the loop — automatic failover.

We already knew that we’d have multiple hosts running in multiple data centers, so the main question was how to get them coordinating with each other so that we would not lose messages in the event of an outage while also ensuring we would not deliver the same message multiple times. Our first thought experiment along these lines was to implement a relatively complex message passing scheme, where two hosts would subscribe to a given channel, one as the primary and one as the backup. The primary would pass a message to the backup saying that it was starting to process a message, and another when it completed. The backup would wait a certain amount of time to receive the first and then the second message from the primary. If a message failed to arrive, it would assume something had gone wrong with the primary and attempt to complete delivery to Apple’s push notification gateway.

Initial Failover Plan

This plan had two major problems: hosts had to be aware of each other and increasing the number of hosts working in conjunction raised the complexity of ensuring reliable delivery.

We liked the idea of having one host serve as a backup for another, but we didn’t like having to coordinate the interaction between so many moving pieces. To solve this issue we went with a convention based approach. Instead of each host having to maintain a list of its partners, we just use Redis to maintain a global lock. Easy enough, right? Perhaps some code is in order!

Finally, some code!

First we create our Redis clients. We need one client for regular Redis commands we use to maintain the lock, and a separate client for Redis pub/sub commands.

var redis = require("redis");
var client = redis.createClient(config.port, config.host);
var pubsubClient = redis.createClient(config.port, config.host);

Next, subscribe to the sharded channel and set up a message handler:

// We could be subscribing to multiple shards, but for the sake of simplicity we’ll just subscribe to one here
pubsubClient.subscribe("notification_" + shard);
pubsubClient.on("message", handleMessage);

Now, the interesting part. We have multiple Node.js processes subscribed to the same Redis pub/sub channel, and each process is in a different data center. Whenever any of them receive a message, they attempt to acquire a lock for that message:

function handleMessage(channel, message) {
    // Error handling elided for brevity
    var payload = JSON.parse(message);

    acquireLock(payload, 1, lockCallback);
}

Managing locks with Redis is made easy using the SETNX command. SETNX is a “set if not exists” primitive. From the Redis docs:

Set key to hold string value if key does not exist. In that case, it is equal to SET. When key already holds a value, no operation is performed.

If we have multiple processes calling SETNX on the same key, the command will only succeed for the process that first makes the call, and in that case the response from Redis will be 1. For subsequent SETNX commands, the key will already exist, and the response from Redis will be 0. The value we try to set with SETNX keeps track of how many attempts have been made to deliver the message, initially set to one, this allows us to retry failed messages a predefined number of times before giving up entirely.

function acquireLock(payload, attempt, callback) {
    var lockIdentifier = "lock." + payload.identifier;

    function dataForCallback(acquired) {
        return {
            "acquired" : acquired,
            "lockIdentifier" : lockIdentifier,
            "payload" : payload,
            "attempt" : attempt
        };
    }

    // The value of the lock key indicates how many lock attempts have been made
    client.setnx(lockIdentifier, attempt, function(error, data) {
        if (error) {
            logger.error("Error trying to acquire redis lock for: %s", lockIdentifier);
            return callback(error, dataForCallback(false));
        }

        return callback(null, dataForCallback(data === 1));
    });
}

At this point our attempt to acquire the lock has either succeeded or failed, and our callback is invoked. What we do next depends on whether we managed to acquire the lock. If we did acquire the lock, we simply attempt to send the message. If we did not acquire the lock, then we will check back later to see if the message was sent successfully (more on this later):

function lockCallback(error, data) {
    // Again, error handling elided for brevity
    if (data && data.acquired) {
        return sendMessage(data.payload, data.lockIdentifier, data.attempt === MAX_ATTEMPTS);
    } else if (data && !data.acquired) {
        return setTimeout(checkLock, LOCK_EXPIRY, data.payload, data.lockIdentifier);
    }
}

Finally, it’s time to actually send the message! We do some work to process the payload into a form we can use to pass to APNS and send it off. If all goes well, we do one of two things:

  1. If this was our first attempt to send the message, we update the lock key in Redis to a sentinel value indicating we were successful. This is the value the backup processes will check for to determine whether or not sending succeeded.
  2. If this was our last attempt to send the message (i.e. the primary process failed to deliver and now a backup process is handling delivery), we simply delete the lock key.
function sendMessage(payload, lockIdentifier, lastAttempt) {
    // Does some work to process the payload and generate an APNS notification object
    var notification = generateApnsNotification(payload);

    if (notification) {
        // The APNS connection is defined/initialized elsewhere
        apnsConnection.sendNotification(notification);

        if (lastAttempt) {
            client.del(lockIdentifier);
        } else {
            client.set(lockIdentifier, DONE_VALUE);
        }
    }
}

There’s one final piece of the puzzle: checking the lock in the process that did not acquire it initially. Here we issue a Redis GET to retrieve the current value of the lock key. If the process that won the lock managed to send the message, this key should be set to a well known sentinel value. If so, we don’t have any work to do, and we can simply delete the lock. However, if this value is not set to that sentinel value, then something went wrong with delivery in the process that originally acquired the lock and we should step up and try to deliver the message from this backup process:

function checkLock(payload, lockIdentifier) {
    client.get(lockIdentifier, function(error, data) {
        // Error handling elided for brevity
        if (data !== DONE_VALUE) {
            acquireLock(payload, data + 1, lockCallback);
        } else {
            client.del(lockIdentifier);
        }
    });
}

Summing Up

So, there you have it in a nutshell. This method of coordinating between processes makes it very easy to adjust the number of processes subscribing to a given shard’s channels. There’s no need for any process subscribed to a channel to be aware of how many other processes are also subscribed. As long as we have at least two processes in separate data centers subscribing to each shard we are protected from all of the from the following scenarios:

  • The crash of any individual Node.js process
  • The loss of a single host running the Node.js processes
  • The loss of an entire data center containing many hosts running the Node.js processes

Let’s go back over our initial goals and see how we fared:

  1. Impact to normal page serving times should be minimal

We accomplish this by minimizing the workload done as part of the normal browser-driven request/response processing. The deferred task system picks up from there, out of band.

  1. Delivery should be in near-real time

Processing stats from our implementation show that time from user actions leading to event generation to message delivery averages about 400ms and is completely event driven (no polling).

  1. Handle thousands of notifications per second

In stress tests of our system, we were able to process more than 2,000 notifications per second on a single host (8 Node.js workers, each subscribing to multiple shards).

  1. The underlying services should be highly available

The availability design is resilient to a variety of failure scenarios, and failover is automatic.

We hope you’re enjoying push notifications in the new Flickr iPhone app.

Addendum!

There was a minor problem with the code in this post when supporting more than two workers. For a full explanation of the problem and the solution, check out Global Redis Locks Redux.

Flickr flamily floto

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at flickr.com/jobs.