CloudFlare

 

Today's System-Wide Upgrade

Upgrade

Today from 21:00 - 23:00 UTC CloudFlare scheduled a maintenance window. During that time, CloudFlare's interface was offline. While it was only two hours of time (and we finished a bit early, at 22:16 UTC) what went on during that window had been in the works for several months. I wanted to take a second and let you know what we just did and why.

When Dinosaurs Roamed the Web

Michelle, Lee and I started working on CloudFlare in early 2009. However, it wasn't until the beginning of 2010 that we invited the first users to sign up. Here's an early invite that went out to Project Honey Pot users. While CloudFlare's network today spans the globe, back then we only had one data center (in Chicago) and about 512 IP addresses (two /24 CIDRs).

Over the course of 2010, we built the product and continued to signup customers. It was a struggle to get our first 100 customers and, when we did, we took the whole team (at the time there were 6 of us) to Las Vegas. One of our very first blog posts was documenting that adventure. While today we regularly sign up 100 new customers an hour, we're really proud of the fact that a lot of those original customers are still CloudFlare customers today.

Cloudflare_vegas_trip

Over the course of the summer of 2010, about 1,000 customers signed up. On September 27, 2010, Michelle and I stepped on stage at TechCrunch Disrupt and launched the service live to the public. We were flooded with new signups, more than tripling in size in 48 hours. Our growth has only accelerated since then.

Provisioning and Accounting

One of the hardest non-obvious challenges to running CloudFlare is the accounting and provisioning of our resources across our customers sites. When someone signs up, we run hundreds of thousands of tests on the characteristics of the site in order to find the best pool to assign the site to. If a site signs up for a plan tier that supports HTTPS then we automatically issue and deploy a SSL certificate. And we spread sites across resource pools to ensure that we don't have hot spots on our network.

Accounting

Originally, we were fairly arbitrary about what customers are assigned to what pool of resources. Over time, we've developed much more sophisticated systems to put new customers into the best pool of resources for them at the moment. However, the system has been relatively static: the pool a site is placed in when you first sign up generally has remained your pool over time.

Moving Sucks

Since provisioning has been relatively static, we had sites that were frozen in time. Those first 100 customers that were on CloudFlare's first IP addresses were mixed between free and paying customers. This lead to less efficient allocation of our server resources and, over time, kept us from better automating a number of systems that would better distribute load and isolate sites that were under attack from the rest of the network.

Moving_sucks

The solution was to migrate everyone from the various old systems to a new system. Lee began planning for this two months ago and stuck around the office over the holidays in order to ensure all the prep work was in place. To give you a sense of the scope, the migration script was 2,487 lines of code. And it would only be run once.

We picked a day after the holidays when all the team would be in the office. We run a global network with customers around the world so there is no quiet time during which to do system-wide maintenance, so we picked a time all our team would be on hand and fully alert. We ordered a pizza lunch for everyone and then, about an hour after lunch, began migrating everyone from various old deployments to a new, modern system.

Replacing the Engine (and Wings) of the Plane in Flight

It is non-trivial to move more than half a million websites around IP space. Sites were rebalanced to new pools of IP addresses. Where necessary, new SSL certificates were seamlessly issued. Custom SSL certificates were redeployed to new machines. From start to finish, the process took about an hour and sixteen minutes.

The process was designed to ensure that there would be no interruption in services. Unless you knew the IP addresses CloudFlare announced for your site, you likely wouldn't notice anything. And for most our customers, it went very smoothly.

We had two issues that affected a handful of customers. First, there was a conflict with some of our web server configurations that prevented a "staging" SSL environment from coming up properly. This staging environment was used as a temporary home for some sites that used SSL as they migrated from their old IP space to their new IP space. As a result, some customers saw SSL errors for about 10 minutes.

Second, a small number of customers were assigned to an IP address that had recently been under a DDoS attack and had been null routed at the routers. This null route would usually be recorded in our database, keeping sites from being assigned to the space until the null route was removed. In this case, the information wasn't correctly recorded and for a short time a small number of sites were on a pair of IP addresses that was unreachable. We removed the null route within a few minutes of realizing the mistake and the sites were again online.

Flexibility

We have known we needed to do this migration for quite some time. Now that it's done, CloudFlare's network is significantly more flexible and robust to ensure fast performance and keep attacks against one site from ever effecting any other customers.

To give you some sense of the flexibility the new system offers, here's a challenge we've faced. As CloudFlare looks to expand its network, some regions where we want to add data centers have restrictions on certain kinds of content being served. For example, in many Middle Eastern countries it is illegal to serve adult content from within their borders. CloudFlare is a reflection of the Internet, so there are adult-oriented sites that use our network. Making matters more difficult, the challenge is that what counts as an "adult" site can change over time.

The new system allows both our automated systems and our ops team to tag sites with certain characteristics. Now we can label a site as "adult" and the system automatically migrates it to a pool of resources that doesn't need to be announced from a particular region where serving the content would be illegal.

A similar use case is a site that is under attack. The new provisioning system allows us to isolate the site from the rest of the network so mitigate any collateral damage to other customers. We can also automatically dedicated additional resources (e.g., data centers in parts of the world that are at a lull of traffic based on the time of the day) in order to better mitigate the attacks. In the end, the benefit here is extreme flexibility.

Flexible

We never like to take our site and API offline for any period of time, and I am disappointed we didn't complete the migration completely without incident, but overall this was a very important, surprisingly complex transition that went very smoothly. CloudFlare's network is now substantially more robust and flexible in order to continue to grow and expand as we continue on our mission to build a better web.

Posted by

 

App: Clearspike automates search engine optimization

Clearspike logo

You care about your website, and you want it to be found. For many visitors, finding your website starts with search engines. Together, Google, Bing, Baidu and others are huge sources of traffic for every website.

The extra speed and security CloudFlare delivers are helpful for search engine ranking, but there are many other factors, including site content, organization and proper promotion.

The newest CloudFlare App, Clearspike automates the search engine optimization (SEO) process to help your website attract more organic search engine traffic.

We know you cared enough to make your website faster and safer. Improving your SEO is a complementary step, and we're pleased to make it easy to use the Clearspike service and tap into the expertise of the Clearspike team for additional benefits.

How it works

Clearspike dashboard

Like other CloudFlare Apps, Clearspike is easy to activate, with different levels of service available immediately, and no long-term commitment.

  • Self-Service Plan: Get custom recommendations and update website yourself. $24 / month.
  • Automated Plan: Use Clearspike tools to get website optimized automatically. $49 / month.
  • Do-It-For-Me Plan: Get Clearspike experts to optimize your website. $199 / month.

There's no tricks: the experts at Clearspike capture a wealth of experience in an easy-to-use service which makes their expertise usable and easy to apply.

At every level of service, Clearspike actively reviews your site for possible improvements, making recommendations and giving you tools to take action. The service includes keyword recommendations, page title optimizations, submission to appropriate directories, finding broken links, checking sitemaps and more. Clearspike helps you measure your progress, too, so you can see the return on your investment in SEO.

Try Clearspike now.

P.S. Clearspike made their service available to CloudFlare customers using the app development platform. CloudFlare is hiring to extend the platform.

Posted by

 

CloudFlare: Fastest Free DNS, Among Fastest DNS Period

Solvedns_december_report

CloudFlare runs one of the largest networks of DNS servers in the world. Over the last few months, we've invested in making our DNS as fast and responsive as possible. We were happy to see these efforts pay off in third-party DNS test results.

The good folks at SolveDNS conduct a monthly survey of the fastest DNS providers in the world. CloudFlare has regularly been in the top-5 fastest DNS providers. This month we're up to number two, with SolveDNS's tests showing an average 4.51ms response time. That's just a hair behind number one (at 4.38ms) and almost twice as fast as number three (at 8.85ms). And, unlike most the other DNS providers in the top-10, CloudFlare's fast Anycast DNS service is provided even for our free plans.

Lest you think we're resting on our laurels, we've got a major DNS release (which we've dubbed RRDNS) scheduled for the next few months that we think will allow us to squeeze a bit more speed out of our DNS lookups. We're shooting for number one!

Posted by

 

CloudFlare's 2012: Happy New Year!

Happy_cloudflare_new_year_2013

For about half the world (and about half of CloudFlare's data centers) it's already 2013. As our team (most of whom are in San Francisco) get ready to celebrate New Year's Eve, wanted to quickly look back on CloudFlare's 2012. Here are some stats that tell the story of our last year:

  • Page views served by CloudFlare in 2012: 679,237,127,874
  • Hits served via CloudFlare's network in 2012: 3,691,532,490,107
  • Bandwidth served from CloudFlare's network in 2012: 76.5 Petabytes
  • Bandwidth we saved our customers in 2012: 43.6 Petabytes
  • New sites that signed up for CloudFlare in 2012: 573,177
  • Threats stopped by CloudFlare in 2012: 281,701,624,076
  • New CloudFlare data centers added in 2012: 10

Over 2012, we saw more than 720 million unique IPs connect to CloudFlare's network. Our best estimate is that behind each of those IPs there are 1.8 Internet users. In other words, we saw approximately 1.3 billion Internet users pass through CloudFlare's network in 2012. That's well over half of the Internet's total population of users.

We also saved a ton of time that those Internet users would have otherwise spent waiting for websites to load. If you add up all the time that people would have spent waiting for websites to load had CloudFlare not existed in 2012, you get more than 891 lifetimes worth of time saved. We're really proud of that.

We have a number of improvements, new features, new data centers, and other surprised lined up for 2013. From everyone at CloudFlare, Happy New Year! Here's to an even faster, safer Internet in the year ahead.

Posted by

 

Optimizing Your Linux Stack for Maximum Mobile Web Performance

The following is a technical post written by Ian Applegate (@AppealingTea), a member of our Systems Engineering team, on how to optimize the Linux TCP stack for mobile connections. The article was originally published as part of the 2012 Web Performance Calendar. At CloudFlare, we spend a significant amount of time ensuring our network stack is tuned to whatever kind of network or device is connecting to us. We wanted to share some of the technical details to help other organizations that are looking to optimize for mobile network performance, even if they're not using CloudFlare. And, if you are using CloudFlare, you get all these benefits and the fastest possible TCP performance when a mobile network accesses your site.


Mobile_web

We spend a lot of time at CloudFlare thinking about how to make the Internet fast on mobile devices. Currently there are over 1.2 billion active mobile users and that number is growing rapidly. Earlier this year mobile Internet access passed fixed Internet access in India and that's likely to be repeated the world over. So, mobile network performance will only become more and more important.

Most of the focus today on improving mobile performance is on Layer 7 with front end optimizations (FEO). At CloudFlare, we've done significant work in this area with front end optimization technologies like Rocket Loader, Mirage, and Polish that dynamically modify web content to make it load quickly whatever device is being used. However, while FEO is important to make mobile fast, the unique characteristics of mobile networks also means we have to pay attention to the underlying performance of the technologies down at Layer 4 of the network stack.

This article is about the challenges mobile devices present, how the default TCP configuration is ill-suited for optimal mobile performance, and what you can do to improve performance for visitors connecting via mobile networks. Before diving into the details, a quick technical note. At CloudFlare, we've build most of our systems on top of a custom version of Linux so, while the underlying technologies can apply to other operating systems, the examples I'll use are from Linux.

TCP Congestion Control

To understand the challenges of mobile network performance at Layer 4 of the networking stack you need to understand TCP Congestion Control. TCP Congestion Control is the gatekeeper that determines how to control the flow of packets from your server to your clients. Its goal is to prevent Internet congestion by detecting when congestion occurs and slowing down the rate data is transmitted. This helps ensure that the Internet is available to everyone, but can cause problems on mobile network when TCP mistakes mobile network problems for congestion.

TCP Congestion Control holds back the floodgates if it detects congestion (i.e. packet loss) on the remote end. A network is, inherently, a shared resource. The purpose of TCP Congestion Control was to ensure that every device on the network cooperates to not overwhelm its resource. On a wired network, if packet loss is detected it is a fairly reliable indicator that a port along the connection is overburdened. What is typically going on in these cases is that a memory buffer in a switch somewhere has filled beyond its capacity because packets are coming in faster than they can be sent out and data is being discarded. TCP Congestion Control on clients and servers is setup to "back off" in these cases in order to ensure that the network remains available for all its users.

But figuring out what packet loss means on a mobile network is a different matter. Radio networks are inherently susceptible to interference which results in packet loss. If pakcets are being dropped does that mean a switch is overburdened, like we can infer on a wired network? Or did someone travel from an undersubscribed wireless cell to an oversubscribed one? Or did someone just turn on a microwave? Or maybe it was just a random solar flare? Since it's not as clear what packet loss means on a mobile network, it's not clear what action a TCP Congestion Control algorithm should take.

A Series of Leaky Tubes

To optimize networks for lossy networks like those on mobile networks, it's important to understand exactly how TCP Congestion Control algorithms are designed. While the high level concept makes sense, the details of TCP Congestion Control are not widely understood by most people working in the web performance industry. That said, it is an important core part of what makes the Internet reliable and the subject of very active research and development.

Ted-stevens

To understand how TCP Congestion Control algorithms work, imagine the following analogy. Think of your web server as your local water utility plant. You've built on a large network of pipes in your hometown and you need to guarantee that each pipe is as pressurized as possible for delivery, but you don't want to burst the pipes. (Note: I recognize the late Senator Ted Stevens got a lot of flack for describing the Internet as a "series of tubes," but the metaphor is surprisingly accurate.)

Your client, Crazy Arty, runs a local water bottling plant that connects to your pipe network. Crazy Arty's infrastructure is built on old pipes that are leaky and brittle. For you to get water to them without bursting his pipes, you need to infer the capability of Crazy Arty's system. If you don't know in advance then you do a test — you send a known amount of water to the line and then measure the pressure. If the pressure is suddenly lost then you can infer that you broke a pipe. If not, then that level is likely safe and you can add more water pressure and repeat the test. You can iterate this test until you burst a pipe, see the drop in pressure, write down the maximum water volume, and going forward ensure you never exceed it.

Imagine, however, that there's some exogenous factor that could decrease the pressure in the pipe without actually indicating a pipe had burst. What if, for example, Crazy Arty ran a pump that he only turned on randomly from time to time and you didn't know about. If the only signal you have is observing a loss in pressure, you'd have no way of knowing whether you'd burst a pipe or if Crazy Arty had just plugged in the pump. The effect would be that you'd likely record a pressure level much less than the amount the pipes could actually withstand — leading to all your customers on the network potentially having lower water pressure than they should.

Optimizing for Congestion or Loss

If you've been following up to this point then you already know more about TCP Congestion Control than you would guess. The initial amount of water we talked about in TCP is known as the Initial Congestion Window (initcwnd) it is the initial number of packets in flight across the network. The congestion window (cwnd) either shrinks, grows, or stays the same depending on how many packets make it back and how fast (in ACK trains) they return after the initial burst. In essence, TCP Congestion Control is just like the water utility — measuring the pressure a network can withstand and then adjusting the volume in an attempt to maximize flow without bursting any pipes.

When a TCP connection is first established it attempts to ramp up the cwnd quickly. This phase of the connection, where TCP grows the cwnd rapidly, is called Slow Start. That's a bit of a misnomer since it is generally an exponential growth function which is quite fast and aggressive. Just like when the water utility in the example above detects a drop in pressure it turns down the volume of water, when TCP detects packets are lost it reduces the size of the cwnd and delays the time before another burst of packets is delivered. The time between packet bursts is known as the Retransmission Timeout (RTO). The algorithm within TCP that controls these processes is called the Congestion Control Algorithm. There are many congestion control algorithms and clients and servers can use different strategies based in the characteristics of their networks. Most of Congestion Control Algorithms focus on optimizing for one type of network loss or another: congestive loss (like you see on wired networks) or random loss (like you see on mobile networks).

Crazy_plumber

In the example above, a pipe bursting would be an indication of congestive loss. There was a physical limit to the pipes, it is exceeded, and the appropriate response is to back off. On the other hand, Crazy Arty's pump is analogous to random loss. The capacity is still available on the network and only a temporary disturbance causes the water utility to see the pipes as overfull. The Internet started as a network of wired devices, and, as its name suggests, congestion control was largely designed to optimize for congestive loss. As a result, the default Congestion Control Algorithm in many operating systems is good for communicating wired networks but not as good for communicating with mobile networks.

A few Congestion Control algorithms try to bridge the gap by using the time of the delay in the "pressure increase" to "expected capacity" to figure out the cause of the loss. These are known as bandwidth estimation algorithms, and examples include Vegas, Veno and Westwood+. Unfortunately, all of these methods are reactive and reuse no information across two similar streams.

At companies that see a significant amount of network traffic, like CloudFlare or Google, it is possible to map the characteristics of the Internet's networks and choose a specific congestion control algorithm in order to maximize performance for that network. Unfortunately, unless you are seeing the large amounts of traffic as we do and can record data on network performance, the ability to instrument your congestion control or build a "weather forecast" is usually impossible. Fortunately, there are still several things you can do to make your server more responsive to visitors even when they're coming from lossy, mobile devices.

Compelling Reasons to Upgrade You Kernel

The Linux network stack has been under extensive development to bring about some sensible defaults and mechanisms for dealing with the network topology of 2012. A mixed network of high bandwidth low latency and high bandwidth, high latency, lossy connections was never fully anticipated by the kernel developers of 2009 and if you check your server's kernel version chances are it's running a 2.6.32.x kernel from that era.

uname -a

Linux

There are a number of reasons that if you're running an old kernel on your web server and want to increase web performance, especially for mobile devices, you should investigate upgrading. To begin, Linux 2.6.38 bumps the default initcwnd and initrwnd (inital receive window) from 3 to 10. This is an easy, big win. It allows for 14.2KB (vs 5.7KB) of data to be sent or received in the initial round trip before slow start grows the cwnd further. This is important for HTTP and SSL because it gives you more room to fit the header in the initial set of packets. If you are running an older kernel you may be able to run the following command on a bash shell (use caution) to set all of your routes' initcwnd and initrwnd to 10. On average, this small change can be one of the biggest boosts when you're trying to maximize web performance.

ip route | while read p; do ip route change $p initcwnd 10 initrwnd 10; done

Linux kernel 3.2 implements Proportional Rate Reduction (PRR). PRR decreases the time it takes for a lossy connection to recover its full speed, potentially improving HTTP response times by 3-10%. The benefits of PRR are significant for mobile networks. To understand why, it's worth diving back into the details of how previous congestion control strategies interacted with loss.

Many congestion control algorithms halve the cwnd when a loss is detected. When multiple losses occur this can result in a case where the cwnd is lower than the slow start threshold. Unfortunately, the connection never goes through slow start again. The result is that a few network interruptions can result in TCP slowing to a crawl for all the connections in the session.

This is even more deadly when combined with tcp_no_metrics_save=0 sysctl setting on unpatched kernels before 3.2. This setting will save data on connections and attempt to use it to optimize the network. Unfortunately, this can actually make performance worse because TCP will apply the exception case to every new connection from a client within a window of a few minutes. In other words, in some cases, one person surfing your site from a mobile phone who has some random packet loss can reduce your server's performance to this visitor even when their temporary loss has cleared.

If you expect your visitors to be coming from mobile, lossy connections and you cannot upgrade or patch your kernel I recommend setting tcp_no_metrics_save=1. If you're comfortable doing some hacking, you can patch older kernels.

The good news is that Linux 3.2 implements PRR, which decreases the amount of time that a lossy connection will impact TCP performance. If you can upgrade, it may be one of the most significant things you can do in order to increase your web performance.

More Improvements Ahead

Linux 3.2 also has another important improvement with RFC2099bis. The initial Retransmission Timeout (initRTO) has been changed to 1s from 3s. If loss happens after sending the initcwnd two seconds waiting time are saved when trying to resend the data. With TCP streams being so short this can have a very noticeable improvement if a connection experiences loss at the beginning of the stream. Like the PRR patch this can also be applied (with modification) to older kernels if for some reason you cannot upgrade (here's the patch).

Looking forward, Linux 3.3 has Byte Queue Limits when teamed with CoDel (controlled delay) in the 3.5 kernel helps fight the long standing issue of Bufferbloat by intelligently managing packet queues. Bufferbloat is when the caching overhead on TCP becomes inefficient because it's littered with stale data. Linux 3.3 has features to auto QoS important packets (SYN/DNS/ARP/etc.,) keep down buffer queues thereby reducing bufferbloat and improving latency on loaded servers.

Linux 3.5 implements TCP Early Retransmit with some safeguards for connections that have a small amount of packet reordering. This allows connections, under certain conditions, to trigger fast retransmit and bypass the costly Retransmission Timeout (RTO) mentioned earlier. By default it is enabled in the failsafe mode tcp_early_retrans=2. If for some reason you are sure your clients have loss but no reordering then you could set tcp_early_retrans=1 to save one quarter a RTT on recovery.

One of the most extensive changes to 3.6 that hasn't got much press is the removal of the IPv4 routing cache. In a nutshell it was an extraneous caching layer in the kernel that mapped interfaces to routes to IPs and saved a lookup to the Forward Information Base (FIB). The FIB is a routing table within the network stack. The IPv4 routing cache was intended to eliminate a FIB lookup and increase performance. While a good idea in principle, unfortunately it provided a very small performance boost in less than 10% of connections. In the 3.2.x-3.5.x kernels it was extremely vulnerable to certain DDoS techniques so it has been removed.

Finally, one important setting you should check, regardless of the Linux kernel you are running: tcp_slow_start_after_idle. If you're concerned about web performance, it has been proclaimed sysctl setting of the year. It can be enabled in almost any kernel. By default this is set to 1 which will aggressively reduce cwnd on idle connections and negatively impact any long lived connections such as SSL. The following command will set it to 0 and can significantly improve performance:

sysctl -w tcp_slow_start_after_idle=0

The Missing Congestion Control Algorithm

You may be curious as to why I haven't made a recommendation as far as a quick and easy change of congestion control algorithms. Since Linux 2.6.19, the default congestion control algorithm in the Linux kernel is CUBIC, which is time based and optimized for high speed and high latency networks. It's killer feature, known as called Hybrid Slow Start (HyStart), allows it to safely exit slow start by measuring the ACK trains and not overshoot the cwnd. It can improve startup throughput by up to 200-300%.

Ack

While other Congestion Control Algorithms may seem like performance wins on connections experiencing high amounts of loss (>.1%) (e.g., TCP Westwood+ or Hybla), unfortunately these algorithms don't include HyStart. The net effect is that, in our tests, they underperform CUBIC for general network performance. Unless a majority of your clients are on lossy connections, I recommend staying with CUBIC.

Of course the real answer here is to dynamically swap out congestion control algorithms based on historical data to better serve these edge cases. Unfortunately, that is difficult for the average web server unless you're seeing a very high volume of traffic and are able to record and analyze network characteristics across multiple connections. The good news is that loss predictors and hybrid congestion control algorithms are continuing to mature, so maybe we will have an answer in an upcoming kernel.

Posted by

 

App: GamaSec Web Application Security and Vulnerability Scanning

We enjoy working with companies who share a focus on website security. When GamaSec, an online web vulnerability-assessment service, inquired about ways to integrate, we were excited to make their scanning service available as a CloudFlare app, where any CloudFlare customer can easily turn on GamaSec. 

GamaSec’s cloud-based security scan serves as an early-warning system of defense for web operation, applications, and online information. GamaSec can be used by any website of any size and is now available to all CloudFlare customers: https://www.cloudflare.com/apps/gamasec

Vulnerability Scanning

GamaSec goes beyond signature-based tools to find more "real" vulnerabilities.

The GamaSec Application Vulnerability Scanner identifies application vulnerabilities such as Cross Site Scripting (XSS), SQL injection, and Code Inclusion, as well as site exposure risks. It also ranks threat priority, produces highly graphical, intuitive HTML reports, and indicates site security posture by vulnerabilities and threat exposure. 

Screen_shot_2012-12-21_at_10

Benefits of GamaSec

Regular use of GamaSec’s on-demand vulnerability assessment service provides the following benefits:

  • Fully automated scans
  • Easy dashboard & reporting
  • Web application SaaS Scanner
  • Update vulnerability protection
  • Trusted Website Security Seal
  • Web Application Scan via Cloud Computing


Plans, pricing and getting started

Like all CloudFlare apps, GamaSec is one-click simple, turned on in a customer's app dashboard.

There are two different plans, including Basic for $7.99 a month, per domain, and Premium for $16.99 a month, per domain, to fit the varied needs of different customers. 

Visit the GamaSec app page to learn more and to get signed up!

 

P.S. GamaSec followed the CloudFlare app development process. CloudFlare is hiring to extend our platform.

Posted by

 

Railgun in the real world: faster web page load times

In past blog posts I've described CloudFlare's Railgun technology that is designed to greatly speed up the delivery of non-cached pages. Although CloudFlare caches about 65% of the resources needed to make up a page, something like 35% can't be cached because they are dynamically generated or marked as 'do not cache'. And those 35% are often the initial HTML of the page that must be downloaded before anything else.

Cacheing-the-uncacheable-cloudflares-railgun-73454

To solve that problem CloudFlare came up with a delta compression technique that recognizes that even dynamically-generated or personalized pages change only a little over time or between users. Railgun uses that compression technique to greatly reduce the amount of data that is sent over the Internet to CloudFlare's data centers from backend web servers. The result is faster delivery of the critical HTML that the browser must receive before it can download the rest of the page.

Testing with Railgun showed that very large compression ratios were possible and they resulted in a large speedup in page delivery. But two questions remained: "what's the effect in the real world?" and "how much difference does that make to page load time?".

We're now able to give some answers to those questions. The first hosting partner to roll out Railgun is Montreal-based Vexxhost. They gave us a sample of 51 web sites that they've enabled Railgun on and allowed us to run performance tests to see what difference Railgun makes. We decided to measure three things: how much faster the HTML is delivered, what the compression ratio is and how much page load time changes.

To get useful numbers we decided to load pages multiple times (each page was loaded 20 times with and without Railgun for a total of 40 downloads) and median values were used. Testing was done by downloading the pages from a machine in London, UK. The median round trip time between the nearest CloudFlare data center (where Railgun was running) and the origin web servers was 78ms.

HTML Delivery Speedup, Time To First Byte and Compression Ratio

On the 51 sites supplied by Vexxhost we saw a median speedup on downloading the HTML of 1.43x. To put that another way that means that the median time to download the HTML of the web pages decreased to 70% of what it was without Railgun.

Of the 51 sites 11 saw a speedup up of greater than 2x (i.e. the time to download the HTML of the web page more than halved) and for 8 of the sites the speedup was greater than 3x (i.e. the time to download the HTML of the web page was cut to a third of the original).

Median_change_html_download

The median compression ratio achieved by Railgun was 0.65% (i.e. the page was reduced to 0.65% of its size). Of the 51 sites, only 9 saw a compression ratio greater than 3% (i.e. most of the pages were reduced to just a tiny percentage of their original size).

It's this huge compression that enables Railgun to speedup HTML delivery dramatically. 

Median_compression

Another measurement to look at is Time To First Byte (how long it takes for the first byte of a page to be delivered to the browser). This is measured as the time from starting the TCP connection to the server to the moment the first byte is received from the server. Railgun has an effect on TTFB as well. The median improvement in TTFB was to drop it to 90% of the non-Railgun-accelerated value.

Ttfb

But HTML delivery is one thing, what's the real end-user visible effect? i.e. how does this translate to a difference in page load time.

Page Load Time

Railgun makes a difference to page load time because it accelerates the download of the initial HTML which has to occur before the rest of the page downloads. Downloading the HTML faster helps the entire page download more quickly. Here's an example of the effect of Railgun on CloudFlare's Plans page. This small test was done from the same machine in London as all the other tests. First here's the waterfall for that page without Railgun enabled.

Screen_shot_2012-12-20_at_2
The page load time was 1.83s. Now with Railgun enabled the page load time dropped to 1.15s because the time to download the initial HTML dropped.

Screen_shot_2012-12-20_at_2
Of course, that's just one test. Repeating the test 10 times with and without Railgun saw a median page load time of 1.59s with Railgun and 2.59s without (making the Railgun accelerated time 61% of the non-accelerated page load time). A similar test with CloudFlare's home page showed a median Railgun-accelerated page load time of 2.56s and without Railgun of 3.2s (i.e. Railgun makes the page load time drop to 83% of what it was).

To measure page load time on the 51 sites supplied by Vexxhost we set up PhantomJS (a headless browser that uses the WebKit for engine) on the same machine as used for the measurements above. A small script enabled us to generate HAR files of the download of entire web pages (including the JavaScript, CSS, HTML and images) and to extract the page load time (we use the 'onload' time).

These page load times include assets that are not accelerated by CloudFlare or by Railgun so they show realistic figures of how Railgun helps. Nevertheless, Railgun helps across the sites picked by Vexxhost with a median decrease in page load time to 89% of the original time. The best increase in median page load time was 56%. A small number of sites didn't see an improvement in page load time (they correspond to sites that didn't get a significant Railgun speedup because they typically only had a small amount of HTML).

A comparison of the same site downloaded via Railgun and not can be seen in these two images. The decrease in page load time is due to the decrease in time to get the initial HTML. Here's the page loading without Railgun:

Screen_shot_2012-12-21_at_10

And with Railgun the intial HTML load is accelerated resulting in a faster overall load time.

Screen_shot_2012-12-21_at_10

The difficulty with measuring page load time to see the Railgun-related improvement is that page load time is highly variable as different assets (especially from sites that are not accelerated by a CDN like CloudFlare) cause the page load time to vary enormously. To get a picture of the expected page load time improvement we can move on from measurement to estimation to check that measurements are similar the expected improvement.

Estimating the Railgun Improvement

One obvious question is to ask how much improvement Railgun can bring to a web site. To work that out you need to know two numbers: the page load time (call it p) and the time to download the initial HTML (call that h). Both values can be obtained from the Developer view in Safari or Chrome or from Firebug.

Railgun will be able to decrease time h. Using the figures above the median improvement would be 70% so you'd expect a page that takes p seconds to load to take roughly p - 0.3 * h with Railgun. In the CloudFlare example above p was 1.83s and h was 0.949s. The formula would give a Railgun page load time of 1.83 - 0.3 * 0.949 = 1.55s (the actual value 1.15s because Railgun did better than the median for that particular page).

In general, the larger the initial HTML the more Railgun can help. Very small pages won't require many round trips between the origin server and Railgun edge server, but larger pages will benefit from the delta compression. And Railgun helps when the web browser and origin server are far apart (for example, when a web site is accessed from around the world, Railgun will help eliminate the round trip time between a web surfer in one country and a web server in another).

To double check the measured performance above we ran a prediction for the sites that Vexxhost gave us. To predict the speedup generated by Railgun we first loaded each page 20 times using PhantomJS and extracted the median page load time (p) and the median time to download the initial HTML (h).

Then using the measured median speedup in the initial HTML load (see the first section above) we predicted that change in page load time by accelerating the initial HTML load and leaving all the other asset load times fixed.

The prediction showed that the median page would load in 93% of the non-Railgun-accelerated time. The measured times were 89%. As with the prediction for the speedup of a CloudFlare page, the measured times are better than the crude predictor, but both show the importance of accelerating the initial HTML load.

Conclusion

In real world tests Railgun gives a median decrease in the page load time to 89% of the non-accelerated time. That translates directly into an improved experience for the end user, and because Railgun runs everywhere in the CloudFlare network it means that page load times are improved for users wherever they are in the world.

Of course, none of this means that web page authors can be complacent about page load time. CloudFlare provides many tools to accelerate web page delivery and web page authors need to be mindful of slow assets and use tools like YSlow to make web page as fast as possible. They need to be particularly mindful of slow third-party assets (such as JavaScript libraries or Like and Share buttons loaded from other domains) as these directly affect page load time.

In fact, the greatest benefit from Railgun comes for sites that have already optimized page load time. Railgun will help drastically reduce the time taken for the already optimized page to reach our edge servers and be sent on to end users. In contrast, a page that has not been optimized may rely on tens of slow or third-party assets that must be downloaded for the page to be ready masking the effects of Railgun.

In a future post I'll look at Railgun performance when accelerating RESTful APIs. And I'll look at the effect of Railgun on subsequent page loads where static assets will be in local cache: in that case Railgun acceleration will be even more noticeable as the HTML download time will be a greater proportion of the total page load time.

 

 

 

 

Posted by

 

Hackers love the holidays

This article was written by John Graham-Cumming on the CloudFlare team and originally published by VentureBeat. We're republishing it here.


Looking at the latest DDoS attack statistics from CloudFlare's network, it seems that hackers love the holidays.

Zooming in on November and December 2012 it's not hard to spot when Thanksgiving 2012 happened. Fully 1/5 of the attacks that CloudFlare saw in November and December (so far) happened on the Thursday and Friday of Thanksgiving:

Novdec

In the past we've seen drops in DDoS attacks on some holidays because the home and office machines used as bots in those attacks have been turned off. For example, this year we noticed a large drop in attack activity on Earth Day (when people are encouraged to switch off their machines to save the planet). But this year's Thanksgiving attack statistics indicate that plenty of hacked machines were online through the holiday.

But what does this tell us about the coming Christmas holiday period? To answer that we can look back to December 2011. CloudFlare has DDoS data for December 11, 2011 to January 1, 2012 which shows two distinct peaks of attack activity: one just before Christmas and one just after.

Dec2011

So, if 2011 is a guide DDoS attackers will be taking a few days off over Christmas, but will be keeping the pressure on just before and immediately after. That's probably not a surprise as some fo the attackers will be attempting to disrupt businesses during critical periods for pre- and post-Christmas sales.

Even though there's a Christmas lull, that doesn't mean that CloudFlare staff will be letting down their guard, however. We'll be here working to ensure that whenever attacks arise and from whereever we're ready to absorb and deflect them.

Posted by

 

It's the Most Wonderful Time of the Year...For Ecommerce Sites

Fe_da_onlineshopping_holidayshoppingslideshow

Forecasters have estimated that online holiday shopping will account for almost 25 percent of total ecommerce sales in 2012. That's more than $54 Billion dollars in online transactions. With so much shopping happening online, we thought we'd talk to one of our ecommerce customers to hear what they do to prepare their site for the busiest time of the year.

Luxury Link curates exclusive travel experiences with luxury properties around the world at insider prices. Chris Holland is the Director of Technology at Luxury Link, and has more than 16 years of web development experience. I recently spoke with Chris to learn more about Luxury Link, what he has seen over the years in the ecommerce industry, and what it's like to run an ecommerce site when the holidays hit.


Can you tell me a little about Luxury Link's story and technical background?

Since 1997, luxurylink.com has evolved from an exclusive e-mail list to exclusive online listings. Luxury Link has pioneered the web-based auction model for Luxury Travel. Our audience is extremely savvy, discerning and demanding of the greatest possible value for the most outstanding luxury vacation experiences. 

While we used to have the niche to ourselves, the online travel landscape is competitive and so we are constantly working to optimize our website to make sure our visitors get the most out of the experience. Our web property experience includes everything from design to merchandising to site performance to SEO and the conversion funnel, as well as offering valuable insights to travelers while accommodating innovative marketing and product strategies. We, in the Tech Team, have our work cut out for ourselves catering to many business functions.


As the company has grown, how have your technology needs changed?

We've had to evolve beyond merely "selling online." It's no longer sufficient to put up a page clamoring "Here are 12 amazing vacations this week." We've seen travelers increasingly seeking inspiration and guidance. Finding the right vacation is a personalized and, at times, challenging process as many variables need to be juggled. While we've dramatically improved search and categorization on our site, we're just getting started. Solving these problems is less about using a specific search technology like Lucene, SphinX, SLI Systems, or Endeca and more about information architecture and accommodating a critical factor: Human curation. Everything you see on our site is an ever-evolving blend of human and machine curation. While search engines will seek out what you want, we have the added responsibility of helping visitors shape their traveling desires.


What are some tips/tricks you can offer other ecommerce site owners?

These core fundamentals really matter: performance, SEO, business intelligence, merchandising, and seasonal relevance.

For site performance, one of the tools we use is CloudFlare. To audit and monitor site speed, we use a blend of inexpensive resources such as webpagetest.org, Google PageSpeed Insights, and WatchMouse (now Nimsoft Cloud Monitor). I'll defer to your local expert to cover SEO.

We leverage Google Analytics and in-house-built event frameworks and data warehousing for various aspects of business intelligence. I'm a big fan of Tableau Software to crunch data.

For any site, and especially commerce sites, analyzing your marketing channels and respective conversion rates can uncover valuable insights: A/B testing is a very important part of this process. We've found PHP Scenario very helpful and we've integrated it into our A/B testing platform.

You might also consider giving your customers a voice by launching a community around your brand. While we've had a community on our site for some time, participation in it had died down. In 2012 we completely revamped it into "The Luxury Lounge" -- This initiative has brought about renewed interest from our loyal members in sharing their travel experiences. It's a veritable trove of great travel insights. It is positive for SEO, as well.


How has ecommerce changed in the last five years?

Online commerce has had to evolve beyond just listing and selling products, as competition and margins have become fierce. Consumers seek insights and guidance. Commerce sites featuring fresh, relevant and timely content in the form of editorial and consumer insights, tend to do better than sites that don't. In recognition of this, Google's algorithm updates have shaken things up. Incumbent sites that once merely listed products are finding themselves displaced by sites offering relevant content about those products. SEO is an exciting world where quality content is king, and this has had an impact on every commerce site I've worked on.


Do you see an increase in traffic during the holidays?

It is typically all about Q1 for the travel industry. We expect a 50% jump in traffic in January over November. While we do have plenty of capacity, CloudFlare's "always-on" feature is a nice safety net.

 
What do you do to prepare the site for the holidays?

We ensure our zabbix monitors are well-tuned, stick to best practices when deploying new code, don't stray away from our phones at nights, and generally do everything we can to ensure the site is running fast.


What are the "hot spots" your site visitors are looking into right now?

The top five pages and locations people are looking at include:

Ski and Snow Destinations
Caribbean
Cabo San Lucas
London
Bali
Guided Tours 


How has CloudFlare impacted your site?

Our server origin is in downtown Los Angeles. We have seen a big speed difference in the average time it takes to download a dynamic web page weighing 23,000 bytes from Texas:

Without CloudFlare: 569 milliseconds
With CloudFlare: 332 milliseconds

This is for dynamic content. In other words, for this type of request, CloudFlare has to fetch the dynamic content from our system, and then pass it along to the user, every time. Going through CloudFlare for dynamic content delivery is 42 percent faster.

Overall, I believe CloudFlare is the best thing to happen to the Web in recent memory, and by extension, the Internet at large. CloudFlare’s infrastructure is staggering and the architecture and pace of innovation are simply impressive. CloudFlare’s offerings have an incredibly positive impact on site owners and web visitors.

Posted by

 

CDNJS: The Fastest Javascript Repo on the Web

Cdnjs_faster_than_google_microsoft_javascript_cdns

More than a year ago, Ryan Kirkman and Thomas Davis approached us about a project they were working on. Dubbed CDNJS, the project had a noble goal: make the world's Javascript resources load as fast as possible. They had been hosting the service on Amazon's CloudFront CDN, but as it got more popular the costs started to be significant. They approached us about whether we'd mind them using CloudFlare. We thought it was a great idea and we've been working together ever since. Today they just sent us data that shows CDNJS is the fastest Javascript repository on the Internet. More on that in a second, but first a bit about why CDNJS is so cool.

Why Do You Need a CDN for Javascript

The there are a core set of Javascript resources that are used across the web. Packages such as jQuery, Bootstrap, Backbone.js, and YUI underpin many of the web's pages. In order for these pages to load, the Javascript resources need to be downloaded. As a result, it makes sense for the resources to be on the fastest connections possible. However, that's only half the story.

The other benefit involves browser caching. If two sites use jQuery, ideally your browser only needs to download it once and then use the same code across both sites. In order to take advantage of this browser caching, both sites need to reference the same code via the same URL. As a result, it not only makes sense to reference a CDN for your Javascript code, but for you to use the same CDN as other sites are also using.

The Big Boys

Google and Microsoft have understood the benefits of having a central repository of Javascript resources and both provide their own public repositories. The challenge is that they only have a limited number of the most popular resources. Moreover, since running the repos isn't their primary job, they are slow to update as new versions of code comes out.

Cdnjs_library

Everything so far is what Ryan and Thomas from CDNJS explained to us. They wanted to build a central repository for Javascript that was fast and reliable. They wanted to make sure it contained a wide range of the web's Javascript resources. They wanted to ensure that the latest versions would always be available. And they wanted to provide it to the web for free. We thought that sounded great, so we took over the job of serving the CDNJS resources from CloudFlare's global network.

Fast Wins

Today Ryan and Thomas sent us the latest data on the performance of CDNJS versus the Google and Microsoft Javascript CDNs. The results are terrific. Graphs are at the top of this post, but the here's the data: on average, CDNJS is 50% faster than the Google's Javascript CDN (100ms vs. 157ms), and more than four times as fast as Microsoft's CDN (100ms vs. 432ms). That's based on data gathered using Pingdom to download the same Javascript resource (jQuery 1.8.3 minified) from the three CDNs from multiple points around the globe over the last week.

Bootstrap_full_package

CDNJS is also expanding beyond just Javascript. They've recently added CSS and Images for popular packages like Bootstrap. In other words, you can load the entire Bootstrap package directly from CDNJS, saving you bandwidth and ensuring it is delivered as quickly as possible. What's also great is that since CloudFlare's network supports SSL, SPDY, and IPv6 by default, these benefits also extend to CDNJS. In other words, if you're using any Javascript resources on your websites it's a no-brainer that you should be loading them from the CDNJS network.

Posted by