Tuesday, January 8, 2013

Improving Twitter search with real-time human computation

One of the magical things about Twitter is that it opens a window to the world in real-time. An event happens, and seconds later, people share it across the planet.

Consider, for example, what happened when Flight 1549 crashed in the Hudson River:


Or when Osama bin Laden was killed:


Or when Obama was re-elected:


When each of these events happened, people instantly came to Twitter, and in particular searched on Twitter to discover what was happening.

From a search and advertising perspective, however, these sudden events pose several challenges:

  1. The queries people perform have probably never before been seen, so it's impossible to know without very specific context what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for "horses and bayonets" are interested in the Presidential debates?

  2. Since these spikes in search queries are so short-lived, there’s only a small window of opportunity to learn what they mean.
So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can — because in just a few hours, the search spike will be gone.

How do we do this? We’ve built a real-time human computation engine to help us identify search queries as soon as they're trending, send these queries to real humans to be judged, and then incorporate the human annotations into our back-end models.

Overview

Before we delve into the details, here's an overview of how the system works.

  1. First, we monitor for which search queries are currently popular.
    Behind the scenes: we run a Storm topology that tracks statistics on search queries.
    For example, the query [Big Bird] may suddenly see a spike in searches from the US.

  2. As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
    Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon's Mechanical Turk service, and then polls Mechanical Turk for a response.
    For example: as soon as we notice "Big Bird" spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

  3. Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.
Monitoring for popular queries

Storm is a distributed system for real-time computation. In contrast to batch systems like Hadoop, which often introduce delays of hours or more, Storm allows us to run online data processing algorithms to discover search spikes as soon as they happen. In brief, running a job on Storm involves creating a Storm topology that describes the processing steps that must occur, and deploying this topology to a Storm cluster. A topology itself consists of three things:

  1. Tuple streams of data. In our case, these may be tuples of (search query, timestamp).

  2. Spouts that produce these tuple streams. In our case, we attach spouts to our search logs, which get written to every time a search occurs.

  3. Bolts that process tuple streams. In our case, we use bolts for operations like updating total query counts, filtering out non-English queries, and checking whether an ad is currently being served up for the query.
Here’s a step-by-step walkthrough of how our query topology works:
  1. Whenever you perform a search on Twitter, the search request gets logged to a Kafka queue.

  2. The Storm topology attaches a spout to this Kafka queue, and the spout emits a tuple containing the query and other metadata (e.g., the time the query was issued and its location) to a bolt for processing.

  3. This bolt updates the count of the number of times we've seen this query, checks whether the query is "currently popular" (using various statistics like time-decayed counts, the geographic distribution of the query, and the last time this query was sent for annotations), and dispatches it to our human computation pipeline if so.
One interesting feature of our popularity algorithm is that we often re-judge queries that have been annotated before, since the intent of a search can change. For example, people may normally search for [Clint Eastwood] because they're interested in his movies, but during the 2012 Republican National Convention users may have wanted to see Tweets related to his speech there.

Human evaluation of popular search queries

We use human computation for a variety of tasks. (See also Clockwork Raven, an open-source project we built that makes launching tasks easier.) For example, we often run experiments to measure ad relevance and search quality, we use it to gather data to train and evaluate our machine learning models. In this section we'll describe how we use it to boost our understanding of popular search queries.

Suppose that our Storm topology has detected that the query [Big Bird] is suddenly spiking. Since the query may remain popular for only a few hours, we send it off to live humans, who can help us quickly understand what it means; this dispatch is performed via a Thrift service that allows us to design our tasks in a web frontend, and later programmatically submit them to Mechanical Turk using any of the different languages we use across Twitter.

On Mechanical Turk, judges are asked several questions about the query that help us serve better ads. Without going into the exact questions, here are flavors of a few possibilities:

- What category does the query belong to? For example, [Stanford] may typically be an education-related query, but perhaps there's a football game between Stanford and Berkeley at the moment, in which case the current search intent would be sports.

- Does the query refer to a person? If so, who? And, what is their Twitter handle if they have one? For example, the query [Happy Birthday Harry] may be trending, but it's hard to know beforehand which of the numerous celebrities named Harry it's referring to. Is it One Direction's Harry Styles, in which case the searcher is likely to be interested in teen pop? Harry Potter, in which case the searcher is likely to be interested in fantasy novels? Or someone else entirely?

Turkers in the machine

Since humans are core to this system, our workforce was designed to give us fast and reliable results.

For completing all our tasks, we use a small custom pool of Mechanical Turk judges to ensure high quality. Other typical possibilities in the crowdsourcing world are to use a static set of in-house judges, to use the standard worker filters that Amazon provides, or to go through an outside company like Crowdflower. We've experimented with these other solutions, and while they have their own benefits, we found that a custom pool fit our needs best for a few reasons:

- A typical industry standard for human evaluation is to use in-house judges. They usually provide high-quality work as they become experts at the evaluation task domain. In-house judges are unfortunately hard to scale as they require standardized hiring processes to be in place. They also tend to be relatively more expensive, it can be harder to communicate with them, and their schedules can be difficult to work with.

- Using the standard sets of workers that Mechanical Turk or other crowdsourcing platforms provide makes it easy to scale the workforce, but we’ve found that their quality doesn’t always meet our needs. Two methods of ensuring high quality are to seed gold-standard examples for which you know the true response throughout your task, or to use statistical analysis to determine which workers are the good ones. But these can be time-consuming and expensive to create, and we often run tasks that can have free-response or require some background research, for which these solutions don't work. Another problem is that using these filters gives you a fluid and constantly changing set of workers — which makes them hard to train.

In contrast:

- Our custom pool of judges work virtually all day. For many of them, this is a full-time job, and they're geographically distributed, so our tasks complete quickly at all hours. We can easily ask for thousands of judgments before lunch, and have them finished by the time we need, which makes iterating on our experiments much easier.

- We have several forums, mailing lists, and even live chat rooms set up, all of which makes it easy for judges to ask us questions and to respond to feedback. Our judges will even give us suggestions on how to improve our tasks; for example, when we run categorization tasks, they'll often report helpful categories that we should add.

- Since we only launch tasks on demand, and Amazon provides a ready source of workers if we ever need more, our judges are never twiddling their thumbs waiting for tasks or completing busywork, and our jobs are rarely backlogged.

- Because our judges are culled from the best of Mechanical Turk, they're experts at the kinds of tasks we send, and can often provide higher quality at a faster rate than what even in-house judges provide. For example, they'll often use the forums and chatrooms to collaborate amongst themselves to give us the best judgments, and they're already familiar with the Firefox and Chrome scripts that help them be the most efficient at their work.

All the benefits described above are especially valuable in this real-time search annotation case:

- Having highly trusted workers means we don't need to wait for multiple annotations on a single search query to confirm validity, so we can send responses to our backend as soon as a single judge responds. This entire pipeline is design for real-time, after all, so the lower the latency on the human evaluation part, the better.

- The static nature of our custom pool means that the judges are already familiar with our questions, and don't need to be trained again.

- Because our workers aren't limited to a fixed schedule or location, they can work anywhere, anytime — which is a requirement for this system, since global event spikes on Twitter are not limited to a standard 40-hour work week.

- And with the multiple easy avenues of communication we have set up, it's easy for us to answer questions that might arise when we add new questions or modify existing ones.

Singing telegram summary

As an example of the kind of top quality our workers provide, we crowdsourced a singing telegram to celebrate the project's launch. Here's what they came up with:

This video was created entirely by our workers, from the crowdsourced lyrics, to the crowdsourced graphics, and even the piano playing and singing. Special thanks in particular to our amazing Turker, workasaurusrex, the musician and silky smooth crooner who brought the masterpiece together.

Many thanks to the Revenue and Storm teams, as well as our Turkers, for their help in launching this project.

Posted by Edwin Chen and Alpa Jain

Friday, December 21, 2012

Right-to-left support for Twitter Mobile

Thanks to the efforts of our translation volunteers, last week we were able to launch right-to-left language support for our mobile website in Arabic and Farsi. Two interesting challenges came up during development for this feature:

1) We needed to support a timeline that has both right-to-left (RTL) and left-to-right (LTR) tweets. We also needed to make sure that specific parts of each tweet, such as usernames and URLs, are always displayed as LTR.

2) For our touch website, we wanted to flip our UI so that it was truly an RTL experience. But this meant we would need to change a lot of our CSS rules to have reversed values for properties like padding, margins, etc. — both time-consuming and unsustainable for future development. We needed a solution that would let us make changes without having to worry about adding in new CSS rules for RTL every time.

In this post, I detail how we handled these two challenges and offer some general RTL tips and other findings we gleaned during development.

General RTL tips

The basis for supporting RTL lies in the dir element attribute, which can be set to either ltr or rtl. This allows you to set an element’s content direction, so that any text/children nodes would render in the orientation specified. You can see the difference below:


In the first row, the text in the LTR column is correct, but in the second it’s the text in the RTL column.

Since this attribute can be used on any element, it can a) be used to change the direction of inline elements, such as links (see “Handling bidirectional tweet content” below) and b) if added to the root html node then the browser will flip the order of all the elements on the page automatically (see “Creating a right-to-left UI” below).

The other way to change content direction lies in the direction and unicode-bidi CSS properties. Just like the dir attribute, the direction property allows you to specify the direction within an element. However, there is one key difference: while direction will affect any block-level elements, for it to affect inline elements the unicode-bidi property must be set to embed or override. Using the dir attribute acts as if both those properties were applied, and is the preferred method as bidi should be considered a document change, not a styling one.

For more on this, see the “W3C directionality specs” section below.

Handling bidirectional tweet content

One of the things we had to think about was how to properly align each tweet depending on the dominant directionality of the content characters. For example, a tweet with mostly RTL characters should be right-aligned and read from right to left. To figure out which chars were RTL, we used this regex:

/[\u0600-\u06FF]|[\u0750-\u077F]|[\u0590-\u05FF]|[\uFE70-\uFEFF]/m

Then depending on how many chars matched, we could figure out the direction we’d want to apply to the tweet.

However, this would also affect the different entities that are in a tweet’s content. Tweet entities are special parts included in the text that has their own context applied to them, such as usernames and hashtags. Usernames and URLs should always be displayed as LTR, while hashtags may be RTL or LTR depending on what the first character is. To solve this, while parsing out entities we also make sure that the correct direction was applied to the element the entities were contained in.

If you are looking to add RTL support for your site and you have dynamic text with mixed directionality, besides using the dir attribute or direction property, you could also look into the \u200e (‎‎) and the \u200f (‏‎) characters. These are invisible control markers that tell the browser how the following text should be displayed. But be careful; conflicts can arise if both the dir / direction and character marker methods are used together. Or if you are using Ruby, Twitter has a great localization gem called TwitterCldr which can take a string and insert these markers appropriately.

Creating a right-to-left UI

For our mobile touch website, we would first detect what language the user’s browser is set in. When it’s one of our supported RTL languages, we add the dir attribute to our page. The browser will then flip the layout of the site so that everything was rendered on the right-hand side first.

This worked fairly well on basic alignment of the page; however, this did not change how all the elements are styled. Properties like padding, margin, text-align, and float will all have the same values, which means that the layout will look just plain wrong in areas where these are applied. This can be the most cumbersome part of adding RTL support to a website, as it usually means adding special rules to your stylesheets to handle this flipped layout.

For our mobile touch website, we are using Google Closure as our stylesheet compiler. This has an extremely convenient flag called --output-orientation, which will go through your stylesheets and adjust the rules according to the value (LTR or RTL) you pass in. By running the stylesheet compilation twice, once with this flag set to RTL, we get two stylesheets that are the mirror images of each other. This fixed nearly all styling issues that came from needing to flip CSS values. In the end, there were only two extra rules that we needed to add to the RTL stylesheet - those were put into rtl.css which gets added on as the last input file for the RTL compilation, thusly overriding any previous rules that were generated.

After that, it’s just a matter of including the right stylesheet for the user’s language and voila! a very nicely RTL’d site with minimal extra effort on the development side.

One last thing that we needed to think about was element manipulation with JS. Since elements will now be pushed as far to the right as possible instead of to the far left, the origin point in which an element starts at may be very different than what you'd expect - possibly even out of the visible area in a container.

For example, we had to change the way that the media strip in our photo gallery moved based on the page’s directionality. Besides coordinates changing, an LTR user would drag starting from the right, then ending to the left in order to see more photos. For an RTL user, the natural inclination would be to start at a left point and drag to the right. This is something that can’t be handled automatically as with our stylesheet compiler, so it comes down to good old-fashioned programming to figure out how we wanted elements to move.

Improving translations

We would like to thank our amazing translations community for helping us get to this point. Without your efforts,we would not have been able to launch this feature onto mobile Twitter. And although we've made great strides in supporting RTL, we still have more work to do.

We would love to have more translations for other languages that are not complete yet, such as our other two RTL languages Hebrew and Urdu. Visit translate.twitter.com to see how you can help us add more languages to Twitter.

Helpful Resources

W3C directionality specs:
More resources:

Posted by Christine Tieu (@ctieu)
Engineer, Mobile Web Team

Thursday, December 20, 2012

How our photo filters came into focus

The old adage “a picture is worth a thousand words” is very apt for Twitter: a single photo can express what otherwise might require many Tweets. Photos help capture whatever we’re up to: kids’ birthday parties, having fun with our friends, the world we see when we travel.

Like so many of you, lots of us here at Twitter really love sharing filtered photos in our tweets. As we got into doing it more often, we began to wonder if we could make that experience better, easier and faster. After all, the now-familiar process for tweeting a filtered photo has required a few steps:

1. Take the photo (with an app)
2. Filter the photo (probably another app)
3. Finally, tweet it!

Constantly needing to switch apps takes time, and results in frustration and wasted photo opportunities. So we challenged ourselves to make the experience as fast and simple as possible. We wanted everyone to be able to easily tweet photos that are beautiful, timeless, and meaningful.

With last week’s photo filters release, we think we accomplished that on the latest versions of Twitter for Android and Twitter for iPhone. Now we'd like to tell you a little more about what went on behind the scenes in order to develop this new photo filtering experience.

It’s all about the filters

Our guiding principle: to create filters that amplify what you want to express, and to help that expression stand the test of time. We began with research, user stories, and sketches. We designed and tested multiple iterations of the photo-taking experience, and relied heavily on user research to make decisions about everything from filters nomenclature and iconography to the overall flow. We refined and distilled until we felt we had the experience right.



We spent many hours poring over the design of the filters. Since every photo is different, we did our analyses across a wide range of photos including portraits, scenery, indoor, outdoor and low-light shots. We also calibrated details ranging from color shifts, saturation, and contrast, to the shape and blend of the vignettes before handing the specifications over to Aviary, a company specializing in photo editing. They applied their expertise to build the algorithms that matched our filter specs.



Make it fast!

Our new photo filtering system is a tight integration of Aviary's cross-platform GPU-accelerated photo filtering technology with our own user interface and visual specifications for filters. Implementing this new UI presented some unique engineering challenges. The main one was the need to create an experience that feels instant and seamless to use — while working within constraints of memory usage and processing speed available on the wide range of devices our apps support.

To make our new filtering experience work, our implementation keeps up to four full-screen photo contexts in memory at once: we keep three full-screen versions of the image for when you’re swiping through photos (the one you’re currently looking at plus the next to the right and the left), and the fourth contains nine small versions of the photo for the grid view. And every time you apply or remove a crop or magic enhance, we update the small images in the grid view to reflect those changes, so it’s always up to date.

Without those, you could experience a lag when scrolling between photos — but mobile phones just don't have a lot of memory. If we weren't careful about when and how we set up these chunks of memory, one result could be running out of memory and crashing the app. So we worked closely with Aviary's engineering team to achieve a balance that would work well for many use cases.

Test and test some more

As soon as engineering kicked off, we rolled out this new feature internally so that we could work out the kinks, sanding down the rough spots in the experience. At first, the team tested it, and then we opened it up to all employees to get lots of feedback. We also engaged people outside the company for user research. All of this was vital to get a good sense about which aspects of the UI would resonate, or wouldn’t.

After much testing and feedback, we designed an experience in which you can quickly and easily choose between different filtering options – displayed side by side, and in a grid. Auto-enhancement and cropping are both a single tap away in an easy-to-use interface.



Finally, a collaborative team of engineers, designers and product managers were able to ship a set of filters wrapped in a seamless UI that anyone with our Android or iPhone app can enjoy. And over time, we want our filters to evolve so that sharing and connecting become even more delightful. It feels great to be able to share it with all of you at last.


Posted by @ryfar
Tweet Composer Team



Thursday, December 13, 2012

Class project: “Analyzing Big Data with Twitter”

Twitter partnered with UC Berkeley this past semester to teach Analyzing Big Data with Twitter, a class with Prof. Marti Hearst. In the first half of the semester, Twitter engineers went to UC Berkeley to talk about the technology behind Twitter: from the basics of scaling up a service to the algorithms behind user recommendations and search. These talks are available online, on the course website.

In the second half of the course, students applied their knowledge and creativity to build data-driven applications on top of Twitter. They came up with a range of products that included tracking bands or football teams, monitoring Tweets to find calls for help, and identifying communities on Twitter. Each project was mentored by one of our engineers.

Last week, 40 of the students came to Twitter HQ to demo their final projects in front of a group of our engineers, designers and engineering leadership team.

The students' enthusiasm and creativity inspired and impressed all of us who were involved. The entire experience was really fun, and we hope to work with Berkeley more in the future.

Many thanks to the volunteer Twitter engineers, to Prof. Hearst, and of course to our fantastic students!




Posted by Gilad Mishne - @gilad
Engineering Manager, Search

Tuesday, December 11, 2012

Blobstore: Twitter’s in-house photo storage system

Millions of people turn to Twitter to share and discover photos. To make it possible to upload a photo and attach it to your Tweet directly from Twitter, we partnered with Photobucket in 2011. As soon as photos became a more native part of the Twitter experience, more and more people began using this feature to share photos.

In order to introduce new features and functionality, such as filters, and continue to improve the photos experience, Twitter’s Core Storage team began building an in-house photo storage system. In September, we began to use this new system, called Blobstore.

What is Blobstore?

Blobstore is Twitter’s low-cost and scalable storage system built to store photos and other binary large objects, also known as blobs. When we set out to build Blobstore, we had three design goals in mind:

  • Low Cost: Reduce the amount of money and time Twitter spent on storing Tweets with photos.
  • High Performance: Serve images in the low tens of milliseconds, while maintaining a throughput of hundreds of thousands of requests per second.
  • Easy to Operate: Be able to scale operational overhead with Twitter’s continuously growing infrastructure.

How does it work?

When a user tweets a photo, we send the photo off to one of a set of Blobstore front-end servers. The front-end understands where a given photo needs to be written, and forwards it on to the servers responsible for actually storing the data. These storage servers, which we call storage nodes, write the photo to a disk and then inform a Metadata store that the image has been written and instruct it to record the information required to retrieve the photo. This Metadata store, which is a non-relational key-value store cluster with automatic multi-DC synchronization capabilities, spans across all of Twitter’s data centers providing a consistent view of the data that is in Blobstore.

The brain of Blobstore, the blob manager, runs alongside the front-ends, storage nodes, and index cluster. The blob manager acts as a central coordinator for the management of the cluster. It is the source of all of the front-ends’ knowledge of where files should be stored, and it is responsible for updating this mapping and coordinating data movement when storage nodes are added, or when they are removed due to failures.

Finally, we rely on Kestrel, Twitter’s existing asynchronous queue server, to handle tasks such as replicating images and ensuring data integrity across our data centers.

We guarantee that when an image is successfully uploaded to Twitter, it is immediately retrievable from the data center that initially received the image. Within a short period of time, the image is replicated to all of our other data centers, and is retrievable from those as well. Because we rely on a multi-data-center Metadata store for the central index of files within Blobstore, we are aware in a very short amount of time whether an image has been written to its original data center; we can route requests there until the Kestrel queues are able to replicate the data.

Blobstore Components

How is the data found?

When an image is requested from Blobstore, we need to determine its location in order to access the data. There are a few approaches to solving this problem, each with its own pros and cons. One such approach is to map or hash each image individually to a given server by some method. This method has a fairly major downside in that it makes managing the movement of images much more complicated. For example, if we were to add or remove a server from Blobstore, we would need to recompute a new location for each individual image affected by the change. This adds operational complexity, as it would necessitate a rather large amount of bookkeeping to perform the data movement.

We instead created a fixed-sized container for individual blobs of data, called a “virtual bucket”. We map images to these containers, and then we map the containers to the individual storage nodes. We keep the total number of virtual buckets unchanged for the entire lifespan of our cluster. In order to determine which virtual bucket a given image is stored in, we perform a simple hash on the image’s unique ID. As long as the number of virtual buckets remains the same, this hashing will remain stable. The advantage of this stability is that we can reason about the movement of data at a much more coarsely grained level than the individual image.

How do we place the data?

When mapping virtual buckets to physical storage nodes, we keep some rules in mind to make sure that we don’t lose data when we lose servers or hard drives. For example, if we were to put all copies of a given image on a single rack of servers, losing that rack would mean that particular image would be unavailable.

If we were to completely mirror the data on a given storage node on another storage node, it would be unlikely that we would ever have unavailable data, as the likelihood of losing both nodes at once is fairly low. However, whenever we were to lose a node, we would only have a single node to source from to re-replicate the data. We would have to recover slowly, so as to not impact the performance of the single remaining node.

If we were to take the opposite approach and allow any server in the cluster to share a range of data on all servers, then we would avoid a bottleneck when recovering lost replicas, as we would essentially be able to read from the entire cluster in order to re-replicate data. However, we would also have a very high likelihood of data loss if we were to lose more than the replication factor of the cluster (two) per data center, as the chance that any two nodes would share some piece of data would be high. So, the optimal approach would be somewhere in the middle: for a given piece of data, there would be a limited number of machines that could share the range of data of its replica - more than one but less than the entire cluster.

We took all of these things into account when we determined the mapping of data to our storage nodes. As a result, we built a library called “libcrunch” which understands the various data placement rules such as rack-awareness, understands how to replicate the data in way that minimizes risk of data loss while also maximizing the throughput of data recovery, and attempts to minimize the amount of data that needs to be moved upon any change in the cluster topology (such as when nodes are added or removed). It also gives us the power to fully map the network topology of our data center, so storage nodes have better data placement and we can take into account rack awareness and placement of replicas across PDU zones and routers.

Keep an eye out for a blog post with more information on libcrunch.

How is the data stored?

Once we know where a given piece of data is located, we need to be able to efficiently store and retrieve it. Because of their relatively high storage density, we are using standard hard drives inside our storage nodes (3.5” 7200 RPM disks). Since this means that disk seeks are very expensive, we attempted to minimize the number of disk seeks per read and write.

We pre-allocate ‘fat’ files on each storage node disk using fallocate(), of around 256MB each. We store each blob of data sequentially within a fat file, along with a small header. The offset and length of the data is then stored in the Metadata store, which uses SSDs internally, as the access pattern for index reads and writes is very well-suited for solid state media. Furthermore, splitting the index from the data saves us from needing to scale out memory on our storage nodes because we don’t need to keep any local indexes in RAM for fast lookups. The only time we end up hitting disk on a storage node is once we already have the fat file location and byte offset for a given piece of data. This means that we can generally guarantee a single disk seek for that read.


Topology Management

As the number of disks and nodes increases, the rate of failure increases. Capacity needs to be added, disks and nodes need to be replaced after failures, servers need to be moved. To make Blobstore operationally easy we put a lot of time and effort into libcrunch and the tooling associated with making cluster changes.


When a storage node fails, data that was hosted on that node needs to be copied from a surviving replica to restore the correct replication factor. The failed node is marked as unavailable in the cluster topology, and so libcrunch computes a change in the mapping from the virtual buckets to the storage nodes. From this mapping change, the storage nodes are instructed to copy and migrate virtual buckets to new locations.

Zookeeper
Topology and placement rules are stored internally in one of our Zookeeper clusters. The Blob Manager deals with this interaction and it uses this information stored in Zookeeper when an operator makes a change to the system. A topology change can consist of adjusting the replication factor, adding, failing, or removing nodes, as well as adjusting other input parameters for libcrunch.

Replication across Data centers

Kestrel is used for cross data center replication. Because kestrel is a durable queue, we use it to asynchronously replicate our image data across data centers.

Data center-aware Routing

TFE (Twitter Frontend) is one of Twitter’s core components for routing. We wrote a custom plugin for TFE, that extends the default routing rules. Our Metadata store spans multiple data centers, and because the metadata stored per blob is small (a few bytes), we typically replicate this information much faster than the blob data. If a user tries to access a blob that has not been replicated to the nearest data center they are routed to, we look up this metadata information and proxy requests to the nearest data center that has the blob data stored. This gives us the property that if replication gets delayed, we can still route requests to the data center that stored the original blob, serving the user the image at the cost of a little higher latency until it’s replicated to the closer data center.

Future work

We have shipped the first version of blobstore internally. Although blobstore started with photos, we are adding other features and use cases that require blob storage to blobstore. And we are also continuously iterating on it to make it more robust, scalable, and easier to maintain.

Acknowledgments

Blobstore was a group effort. The following folks have contributed to the project: Meher Anand (@meher_anand), Ed Ceaser (@asdf), Harish Doddi (@thinkingkiddo), Chris Goffinet (@lenn0x), Jack Gudenkauf (@_jg), and Sangjin Lee (@sjlee).

Posted by Armond Bigian @armondbigian
Engineering Director, Core Storage & Database Engineering