The GitHub Blog

The open source Git project has just released Git 2.11.0, with features and bugfixes from over 70 contributors. Here's our look at some of the most interesting new features:

Abbreviated SHA-1 names

Git 2.11 prints longer abbreviated SHA-1 names and has better tools for dealing with ambiguous short SHA-1s.

You've probably noticed that Git object identifiers are really long strings of hex digits, like 66c22ba6fbe0724ecce3d82611ff0ec5c2b0255f. They're generated from the output of the SHA-1 hash function, which is always 160 bits, or 40 hexadecimal characters. Since the chance of any two SHA-1 names colliding is roughly the same as getting struck by lightning every year for the next eight years¹, it's generally not something to worry about.

You've probably also noticed that 40-digit names are inconvenient to look at, type, or even cut-and-paste. To make this easier, Git often abbreviates identifiers when it prints them (like 66c22ba), and you can feed the abbreviated names back to other git commands. Unfortunately, collisions in shorter names are much more likely. For a seven-character name, we'd expect to see collisions in a repository with only tens of thousands of objects².

To deal with this, Git checks for collisions when abbreviating object names. It starts at a relatively low number of digits (seven by default), and keeps adding digits until the result names a unique object in the repository. Likewise, when you provide an abbreviated SHA-1, Git will confirm that it unambiguously identifies a single object.

So far, so good. Git has done this for ages. What's the problem?

The issue is that repositories tend to grow over time, acquiring more and more objects. A name that's unique one day may not be the next. If you write an abbreviated SHA-1 in a bug report or commit message, it may become ambiguous as your project grows. This is exactly what happened in the Linux kernel repository; it now has over 5 million objects, meaning we'd expect collisions with names shorter than 12 hexadecimal characters. Old references like this one are now ambiguous and can't be inspected with commands like git show.

To address this, Git 2.11 ships with several improvements.

First, the minimum abbreviation length now scales with the number of objects in the repository. This isn't foolproof, as repositories do grow over time, but growing projects will quickly scale up to larger, future-proof lengths. If you use Git with even moderate-sized projects, you'll see commands like git log --oneline produce longer SHA-1 identifiers. [source]

That still leaves the question of what to do when you somehow do get an ambiguous short SHA-1. Git 2.11 has two features to help with that. One is that instead of simply complaining of the ambiguity, Git will print the list of candidates, along with some details of the objects. That usually gives enough information to decide which object you're interested in. [source]

Of course, it's even more convenient if Git simply picks the object you wanted in the first place. A while ago, Git learned to use context to figure out which object you meant. For example, git log expects to see a commit (or a tag that points to a commit). But other commands, like git show, operate on any type of object; they have no context to guess which object you meant. You can now set the core.disambiguate config option to prefer a specific type. [source]

Performance Optimizations

One of Git's goals has always been speed. While some of that comes from the overall design, there are a lot of opportunities to optimize the code itself. Almost every Git version ships with more optimizations, and 2.11 is no exception. Let's take a closer look at a few of the larger examples.

Delta Chains

Git 2.11 is faster at accessing delta chains in its object database, which should improve the performance of many common operations. To understand what's going on, we first have to know what the heck a delta chain is.

You may know that Git avoids storing files multiple times, because all data is stored in objects named after the SHA-1 of the contents. But in a version control system, we often see data that is almost identical (i.e., your files change just a little bit from version to version). Git stores these related objects as "deltas": one object is chosen as a base that is stored in full, and other objects are stored as a sequence of change instructions from that base, like "remove bytes 50-100" and "add in these new bytes at offset 50". The resulting deltas are a fraction of the size of the full object, and Git's storage ends up proportional to the size of the changes, not the size of all versions.

As files change over time, the most efficient base is often an adjacent version. If that base is itself a delta, then we may form a chain of deltas: version two is stored as a delta against version one, and then version three is stored as a delta against version two, and so on. But these chains can make it expensive to reconstruct the objects when we need them. Accessing version three in our example requires first reconstructing version two. As the chains get deeper and deeper, the cost of reconstructing intermediate versions gets larger.

For this reason, Git typically limits the depth of a given chain to 50 objects. However, when repacking using git gc --aggressive, the default is bumped to 250, with the assumption that it would make a significantly smaller pack. But that number was chosen somewhat arbitrarily, and it turns out that the ideal balance between size and CPU actually is around 50. So that's the default in Git 2.11, even for aggressive repacks. [source]

Even 50 deltas is a lot to go through to construct one object. To reduce the impact, Git keeps a cache of recently reconstructed objects. This works out well because deltas and their bases tend to be close together in history, so commands like git log which traverse history tend to need those intermediate bases again soon. That cache has an adjustable size, and has been bumped over the years as machines have gotten more RAM. But due to storing the cache in a fairly simple data structure, Git kept many fewer objects than it could, and frequently evicted entries at the wrong time.

In Git 2.11, the delta base cache has received a complete overhaul. Not only should it perform better out of the box (around 10% better on a large repository), but the improvements will scale up if you adjust the core.deltaBaseCacheLimit config option beyond its default of 96 megabytes. In one extreme case, setting it to 1 gigabyte improved the speed of a particular operation on the Linux kernel repository by 32%. [source, source]

Object Lookups

The delta base improvements help with accessing individual objects. But before we can access them, we have to find them. Recent versions of Git have optimized object lookups when there are multiple packfiles.

When you have a large number of objects, Git packs them together into "packfiles": single files that contain many objects along with an index for optimized lookups. A repository also accumulates packfiles as part of fetching or pushing, since Git uses them to transfer objects over the network. The number of packfiles may grow from day-to-day usage, until the next repack combines them into a single pack. Even though looking up an object in each packfile is efficient, if there are many packfiles Git has to do a linear search, checking each packfile in turn for the object.

Historically, Git has tried to reduce the cost of the linear search by caching the last pack in which an object was found and starting the next search there. This helps because most operations look up objects in order of their appearance in history, and packfiles tend to store segments of history. Looking in the same place as our last successful lookup often finds the object on the first try, and we don't have to check the other packs at all.

In Git 2.10, this "last pack" cache was replaced with a data structure to store the packs in most recently used (MRU) order. This speeds up object access, though it's only really noticeable when the number of packs gets out of hand.

In Git 2.11, this MRU strategy has been adapted to the repacking process itself, which previously did not even have a single "last found" cache. The speedups are consequently more dramatic here; repacking the Linux kernel from a 1000-pack state is over 70% faster. [source, source]

Patch IDs

Git 2.11 speeds up the computation of "patch IDs", which are used heavily by git rebase.

Patch IDs are a fingerprint of the changes made by a single commit. You can compare patch IDs to find "duplicate" commits: two changes at different points in history that make the exact same change. The rebase command uses patch IDs to find commits that have already been merged upstream.

Patch ID computation now avoids both merge commits and renames, improving the runtime of the duplicate check by a factor of 50 in some cases. [source, source]

Advanced filter processes

Git includes a "filter" mechanism which can be used to convert file contents to and from a local filesystem representation. This is what powers Git's line-ending conversion, but it can also execute arbitrary external programs. The Git LFS system hooks into Git by registering its own filter program.

The protocol that Git uses to communicate with the filter programs is very simple. It executes a separate filter for each file, writes the filter input, and reads back the filter output. If you have a large number of files to filter, the overhead of process startup can be significant, and it's hard for filters to share any resources (such as HTTP connections) among themselves.

Git 2.11 adds a second, slightly more complex protocol that can filter many files with a single process. This can reportedly improve checkout times with many Git LFS objects by as much as a factor of 80.

The original protocol is still available for backwards compatibility, and the new protocol is designed to be extensible. Already there has been discussion of allowing it to operate asynchronously, so the filter can return results as they arrive. [source]

Sundries

In our post about Git 2.9, we mentioned some improvements to the diff algorithm to make the results easier to read (the --compaction-heuristic option). That algorithm did not become the default because there were some corner cases that it did not handle well. But after some very thorough analysis, Git 2.11 has an improved algorithm that behaves similarly but covers more cases and does not have any regressions. The new option goes under the name --indent-heuristic (and diff.indentHeuristic), and will likely become the default in a future version of Git. [source]
Ever wanted to see just the commits brought into a branch by a merge commit? Git now understands negative parent-number selectors, exclude the given parent (rather than selecting it). It may take a minute to wrap your head around that, but it means that git log 1234abcd^-1 will show all of the commits that were merged in by 1234abcd, but none of the commits that were already on the branch. You can also use ^- (omitting the 1) as a shorthand for ^-1. [source]
There's now a credential helper in contrib/ that can use GNOME libsecret to store your Git passwords. [source]
The git diff command now understands --submodule=diff (as well as setting the diff.submodule config to diff), which will show changes to submodules as an actual patch between the two submodule states. [source]
git status has a new machine-readable output format that is easier to parse and contains more information. Check it out if you're interested in scripting around Git. [source]
Work has continued on converting some of Git's shell scripts to C programs. This can drastically improve performance on platforms where extra processes are expensive (like Windows), especially in programs that may invoke sub-programs in a loop. [source, source]

The whole shebang

That's just a sampling of the changes in Git 2.11, which contains over 650 commits. Check out the the full release notes for the complete list.

[1] It's true. According to the National Weather Service, the odds of being struck by lightning are 1 in a million. That's about 1 in 2²⁰, so the odds of it happening in 8 consecutive years (starting with this year) are 1 in 2¹⁶⁰.

[2] It turns out to be rather complicated to compute the probability of seeing a collision, but there are approximations. With 5 million objects, there's about a 1 in 10³⁵ chance of a full SHA-1 collision, but the chance of a collision in 7 characters approaches 100%. The more commonly used metric is "numbers of items to reach a 50% chance of collision", which is the square root of the total number of possible items. If you're working with exponents, that's easy; you just halve the exponent. Each hex character represents 4 bits, so a 7-character name has 2²⁸ possibilities. That means we expect a collision around 2¹⁴, or 16384 objects.

Tickets for Git Merge 2017 are now on sale 🎉

Git Merge is the pre-eminent Git-focused conference: a full day offering technical talks and user case studies, plus a full day of pre-conference, add-on workshops for Git users of all levels. Git Merge 2017 will take place February 2-3 in Brussels.

Confirmed Speakers

Durham Goode, Facebook
Santiago Perez De Rosso, MIT
Carlos Martin Nieto, GitHub

Workshops
Git users of all levels are invited to dive into a variety of topics with some of the best Git trainers in the world. Learn about improving workflows with customized configurations, submodules and subtrees, getting your repo under control, and much more. Workshops are included in the cost of a conference ticket, but space is limited. Make sure to RSVP when you get your conference ticket.

Sponsorship Opportunities
By sponsoring Git Merge, you are supporting a community of users and developers dedicated a tool that's become integral to your development workflow. Check out the Sponsorship Prospectus for more information.

Tickets
Tickets are €99 and all proceeds are donated to the Software Freedom Conservancy. General admission also includes entrance to the after party.

It's time to cozy up with the all new Octoplush collectable-available now in the GitHub Shop.

Share the Octoplush with friends and family. Just don't feed these octocats. They're already stuffed.

Now through Tuesday, enjoy 30% off everything in the GitHub Shop with discount code OCTOCYBER2016 and free shipping for orders over $30.

The GitHub Extension for Visual Studio now supports Visual Studio 2017 RC, including support for cloning repositories directly from the Visual Studio Start Page. We've also improved our startup time to get you productive as fast as possible.

This release, version 2.1.1.0, is available as an optional installation component in the installer for both Visual Studio 2015 and Visual Studio 2017 RC. You can also install it directly from the Visual Studio gallery or download it from visualstudio.github.com.

Last year we introduced the GitHub Extension for Visual Studio as an open source project under the MIT license. You can log issues and contribute to the extension in the repository.

Go ahead, color outside the lines with the GitHub Activity Book starring our very own Mona the Octocat! Now available in the GitHub Shop.

GitKraken is now part of the Student Developer Pack. Students can manage Git projects in a faster, more user-friendly way with GitKraken's Git GUI for Windows, Mac, and Linux.

GitKraken is a cross-platform GUI for Git that makes Git commands more intuitive. The interface equips you with a visual understanding of branching, merging and your commit history. GitKraken works directly with your repositories with no dependencies—you don’t even need to install Git on your system. You’ll also get a built-in merge tool with syntax highlighting as well as one-click undo and redo for when you make mistakes. Other features of GitKraken are:

Drag and drop to merge, rebase, reset, push
Resizable, easy-to-understand commit graph
File history and blame
View image diffs in app
Fuzzy finder and command palette
Submodules and Gitflow support
Easily clone, add remotes, and open pull requests in app
Keyboard shortcuts
Dark and light color themes
GitHub integration

Members of the pack get GitKraken Pro free for one year. With GitKraken Pro, Student Developer Pack members will get all the features of GitKraken plus:

The ability to resolve merge conflicts in the app
Multiple profiles for work and personal use
Support for GitHub Enterprise

Students can get free access to professional developer tools from companies like Datadog, Travis CI, and Unreal Engine. The Student Developer Pack lets you learn, experiment, and build software with the tools developers use at work every day without worrying about cost.

Students, get a Git GUI now with your pack.

Today is Veteran’s Day here in the United States, or Remembrance Day in many places around the world, when we recognize those who have served in the military. Today many businesses will offer veterans a cup of coffee or a meal, but one organization goes further.

You might have watched ex-Army Captain David Molina speak at CodeConf LA, or GitHub Universe about Operation Code, a nonprofit he founded in 2014 after he couldn’t use the benefits of the G.I. Bill to pay for code school. Operation Code lowers the barrier of entry into software development and helps military personnel in the United States better their economic outcomes as they transition to civilian life. They leverage open source communities to provide accessible online mentorship, education, and networking opportunities.

The organization is also deeply invested in facilitating policy changes that will allow veterans to use their G.I. Bill benefits at coding schools and boot camps, speeding up their re-entry to the workforce. Next week Captain Molina will testify in Congress as to the need for these updates. The video below explains more about their work.

Operation Code - On a mission to expand the GI Bill

Although Operation Code currently focuses on the United States, they hope to develop a model that can be replicated throughout the world.

Why Operation Code matters

Operation Code is working to address a problem that transcends politics. Here's a look into the reality U.S. veterans face:

The unemployment rate for veterans over the age of 18 as of August 2016 is 3.9% for men and 7.0% for women.
As of 2014, less that seven percent of enlisted personnel have a Bachelor’s degree or higher
More than 200,000 active service members leave the military every year, and are in need of employment
U.S. Studies show that members of underrepresented communities are more frequently joining the military to access better economic and educational opportunities

How you can help

Connect with and donate to Operation Code at operationcode.org
If you manage a tech conference, consider offering scholarship tickets to veterans.
If you are a code school or bootcamp, contact Operation Code, and they can guide you through the process of G.I. Bill approval with your state.
If you are a code school or coding bootcamp, provide veterans with scholarships, partial or full-ride.
Hire an Operation Code veteran through the Military Veterans Technical Talent Pipeline.
Submit a pull request to help Operation Code improve their online platform.
Volunteer to help teach and mentor veterans in the programming language of your choice.

The GitHub team is getting ready for AWS re:Invent on November 28, and we'd love to meet you there.

Why? GitHub works alongside AWS to ensure your code is produced and shipped quickly and securely, giving you a platform that plugs right into existing workflows, saving time and allowing your team to use tools they’re already familiar with. And at AWS re:Invent, we’re hosting events throughout the week to help you learn how GitHub and AWS work together.

Level up your DevOps program

DevOps is a never-ending journey, and implementing the best tools and practices for the job is only the beginning. Hear from Accenture Federal Services’ Natalie Bradley and GitHub’s Matthew McCullough about how GitHub Enterprise and AWS formed the backbone of a DevOps program that not only raised code quality and shipping speed, but defined how to scale tools for thousands of users.

Unwind at TopGolf

Join us on Tuesday, the 29th for a party at TopGolf—the perfect place to unwind from a full day of travel, training, or meetings. Tee time is 7:30 PM at the MGM Grand. RSVP today.

Meet with GitHub Engineers

You'll also have a chance to get some in-depth advice from our team of technical Hubbers headed to Vegas by scheduling a 1:1 chat with them.

You can visit the Octobooth on the expo floor to watch live demos, talk to one of our product specialists, or just grab some swag. Stop by and say hi at booth #607.

GitHub Enterprise 2.8 adds power and versatility directly into your workflow with Reviews for more streamlined code review and discussion, Projects to bring development-centric project management into GitHub, and Jupyter notebook rendering to visualize data.

Code better with Reviews

Reviews help you build flexible code review workflows into your pull requests, streamlining conversations, reducing notifications, and adding more clarity to discussions. You can comment on specific lines of code, formally "approve" or "request changes" to pull requests, batch comments, and have multiple conversations per line. These initial improvements are only the first step of a much greater roadmap toward faster, friendlier code reviews.

Organize projects while staying close to your code

With Projects, you can manage work directly from your GitHub repositories. Create task cards from pull requests, issues, or notes and drag and drop them into categorized columns. You can use categories like "In-progress", "Done", "Never going to happen", or any other framework your team prefers. Move the cards within a column to prioritize them or from one column to another as your work progresses. And with notes, you can capture every early idea that comes up as part of your standup or team sync, without polluting your list of issues.

Visualize data-driven workflows with Jupyter Notebook rendering

Producing and sharing data on GitHub is a common challenge for researchers and data scientists. Jupyter notebooks make it easy to capture those data-driven workflows that combine code, equations, text, and visualizations. And now they render in all your GitHub repositories.

Share your story as a developer

This release takes the contribution graph to new heights with your GitHub timeline—a snapshot of your most important triumphs and contributions. Curate and showcase your defining moments, from pinned repositories that highlight your best work to a profile timeline that chronicles important milestones in your career.

Amp up administrator visibility and security enforcement

GitHub Enterprise 2.8 gives administrators more ways to enforce security policies, understand and improve performance, and get developers the support they need. Site admins can now enforce the use of two-factor authentication at the organization level, efficiently visualize LDAP authentication-related problems—like polling, repeated failed login attempts, and slow servers—and direct users to their support website throughout the appliance.

Upgrade today

Upgrade to GitHub Enterprise 2.8 today to start using these new features and keep improving the way your team works. You can also check out the release notes to see what else is new or enable update checks to automatically check for the latest releases of GitHub Enterprise.

GitHub Pages has upgraded to Jekyll 3.3.0, a release with some nice quality-of-life features.

First, Jekyll 3.3 introduces two new convenience filters, relative_url and absolute_url. They provide an easy way to ensure that your site's URLs will always appear correctly, regardless of where or how your site is viewed. To make it easier to use these two new filters, GitHub Pages now automatically sets the site.url and site.baseurl properties, if they're not already set.

This means that starting today {{ "/about/" | relative_url }} will produce /repository-name/about/ for Project Pages (and /about/ for User Pages). Additionally, {{ "/about/" | absolute_url }} will produce https://baxterthehacker.github.io/repository-name/about/ for Project Pages (and https://baxterthehacker.github.io/about/ for User Pages or http://example.com/about/ if you have a custom domain set up).

Second, with Jekyll 3.3, when you run jekyll serve in development, it will override your url value with http://localhost:4000. No more confusion when your locally-modified CSS isn't loading because the URL is set to the production site. Additionally, site.url and absolute_url will now yield http://localhost:4000 when running a server locally.

Finally, to make it easier to vendor third-party dependencies via package managers like Bundler or NPM (or Yarn), Jekyll now ignores the vendor and node_modules directories by default, speeding up build times and avoiding potential errors. If you need those directories included in your site, set exclude: [] in your site's configuration file.

For more information, see the Jekyll changelog and if you have any questions, please let us know.

Happy publishing!

We announced the GitHub Game Jam, our very own month-long game jam, a few weeks ago. Today, we're announcing the theme and officially kicking it off. Ready player one!

The Challenge

You have the entire month of November to create a game loosely based on the theme hacking, modding and/or augmenting.

What do we mean by loosely based on hacking, modding and/or augmenting? Here are some examples:

an endless runner where you hack down binary trees in your path with a pixelated axe
a modern take on a classic e.g. a roguelike set in a 3D or VR world
an augmented reality game bringing octopus/cat hybrids into the real world

Unleash your creativity. You can work alone or with a team and build for any platform or device. The use of open source game engines and libraries is encouraged but not required.

We'll highlight some of our favorites games on the GitHub blog, and the world will get to enjoy (and maybe even contribute to or learn from) your creations.

How to participate

Sign up for a free personal account if you don't already have one
Fork the github/game-off-2016 repository to your personal account (or to a free organization account)
Clone the repository on your computer and build your game
Push your game source code to your forked repository before December 1st
Update the README.md file to include a description of your game, how to play or download it, how to build and compile it, what dependencies it has, etc
Submit your final game using this form

It's dangerous to go alone

If you're new to Git, GitHub, or version control…

Git Documentation: everything you need to know about version control and how to get started with Git
GitHub Help: everything you need to know about GitHub
Questions about GitHub? Please contact our Support team and they'll be delighted to help you
Questions specific to the GitHub Game Off? Please create an issue. This will be the official FAQ

The official Twitter hashtag for the Game Off is #ggo16. We look forward to playing your games.

GLHF! <3

We’re kicking off 2017 with Git Merge, February 2-3 in Brussels. Join us for a full day of technical talks and user case studies, plus a day of pre-conference workshops for Git users of all levels (RSVP is required, as space is limited). If you’ll be in Brussels for FOSDEM, come in early and stop by. Just make sure to bundle up!

Git Merge is the pre-eminent Git-focused conference dedicated to amplifying new voices from the Git community and to showcasing thought-provoking projects from contributors, maintainers, and community managers. When you participate in Git Merge, you’ll contribute to one of the largest and most forward-thinking communities of developers in the world.

Call for Speakers
We're accepting proposals starting now through Monday, November 21. Submit a proposal and we’ll email you back by Friday, December 9. For more information on our process and what kind of talks we’re seeking, check out our Call For Proposals (CFP).

Code of Conduct
Git Merge is about advancing the Git community at large. We value the participation of each member and want all attendees to have an enjoyable and fulfilling experience. Check out our Code of Conduct for complete details.

Sponsorship
Git Merge would not be possible without the help of our sponsors and community partners. If you're interested in sponsoring Git Merge, you can download the sponsorship prospectus for more information.

Tickets
Tickets are €99 and all proceeds are donated to the Software Freedom Conservancy. General Admission includes access to the pre-conference workshops and after party in addition to the general sessions.

See you in Brussels!

On Thursday, October 20th, a bug in GitHub’s system exposed a small amount of user data via Git pulls and clones. In total, 156 private repositories of GitHub.com users were affected (including one of GitHub's). We have notified everyone affected by this private repository disclosure, so if you have not heard from us, your repositories were not impacted and there is no ongoing risk to your information.

This was not an attack, and no one was able to retrieve vulnerable data intentionally. There was no outsider involved in exposing this data; this was a programming error that resulted in a small number of Git requests retrieving data from the wrong repositories.

Regardless of whether or not this incident impacted you specifically, we want to sincerely apologize. It’s our responsibility not only to keep your information safe but also to protect the trust you have placed in us. GitHub would not exist without your trust, and we are deeply sorry that this incident occurred.

Below is the technical analysis of our investigation, including a high-level overview of the incident, how we mitigated it, and the specific measures we are taking to safeguard against incidents like this from happening in the future.

High-level overview

In order to speed up unicorn worker boot times, and simplify the post-fork boot code, we applied the following buggy patch:

The database connections in our rails application are split into three pools: a read-only group, a group used by Spokes (our distributed Git back-end), and the normal Active Record connection pool. The read-only group and the Spokes group are managed manually, by our own connection handling code. This meant the pool was shared between all child processes of the rails application when running using the change. The new line of code disconnected only ConnectionPool objects that are managed by Active Record, whereas the previous snippet would disconnect all ConnectionPool objects held in memory.

The impact of this bug for most queries was a malformed response, which errored and caused a near immediate rollback. However, a very small percentage of the queries responses were interpreted as legitimate data in the form of the file server and disk path where repository data was stored. Some repository requests were routed to the location of another repository. The application could not differentiate these incorrect query results from legitimate ones, and as a result, users received data that they were not meant to receive.

When properly functioning, the system works as sketched out roughly below. However, during this failure window, the MySQL response in step 4 was returning malformed data that would end up causing the git proxy to return data from the wrong file server and path.

Our analysis of the ten-minute window in question uncovered:

17 million requests to our git proxy tier, most of which failed with errors due to the buggy deploy
2.5 million requests successfully reached git-daemon on our file server tier
Of the 2.5 million requests that reached our file servers, the vast majority were "already up to date" no-op fetches
40,000 of the 2.5 million requests were non-empty fetches
230 of the 40,000 non-empty requests were susceptible to this bug and served incorrect data
This represented 0.0013% of the total operations at the time

Deeper analysis and forensics

After establishing the effects of the bug, we set out to determine which requests were affected in this way for the duration of the deploy. Normally, this would be an easy task, as we have an in-house monitor for Git that logs every repository access. However, those logs contained some of the same faulty data that led to the misrouted requests in the first place. Without accurate usernames or repository names in our primary Git logs, we had to turn to data that our git proxy and git-daemon processes sent to syslog. In short, the goal was to join records from the proxy, to git-daemon, to our primary Git logging, drawing whatever data was accurate from each source. Correlating records across servers and data sources is a challenge because the timestamps differ depending on load, latency, and clock skew. In addition, a given Git request may be rejected at the proxy or by git-daemon before it reaches Git, leaving records in the proxy logs that don’t correlate with any records in the git-daemon or Git logs.

Ultimately, we joined the data from the proxy to our Git logging system using timestamps, client IPs, and the number of bytes transferred and then to git-daemon logs using only timestamps. In cases where a record in one log could join several records in another log, we considered all and took the worst-case choice. We were able to identify cases where the repository a user requested, which was recorded correctly at our git proxy, did not match the repository actually sent, which was recorded correctly by git-daemon.

We further examined the number of bytes sent for a given request. In many cases where incorrect data was sent, the number of bytes was far larger than the on-disk size of the repository that was requested but instead closely matched the size of the repository that was sent. This gave us further confidence that indeed some repositories were disclosed in full to the wrong users.

Although we saw over 100 misrouted fetches and clones, we saw no misrouted pushes, signaling that the integrity of the data was unaffected. This is because a Git push operation takes place in two steps: first, the user uploads a pack file containing files and commits. Then we update the repository’s refs (branch tips) to point to commits in the uploaded pack file. These steps look like a single operation from the user’s point of view, but within our infrastructure, they are distinct. To corrupt a Git push, we would have to misroute both steps to the same place. If only the pack file is misrouted, then no refs will point to it, and git fetch operations will not fetch it. If only the refs update is misrouted, it won’t have any pack file to point to and will fail. In fact, we saw two pack files misrouted during the incident. They were written to a temporary directory in the wrong repositories. However, because the refs-update step wasn’t routed to the same incorrect repository, the stray pack files were never visible to the user and were cleaned up (i.e., deleted) automatically the next time those repositories performed a “git gc” garbage-collection operation. So no permanent or user-visible effect arose from any misrouted push.

A misrouted Git pull or clone operation consists of several steps. First, the user connects to one of our Git proxies, via either SSH or HTTPS (we also support git-protocol connections, but no private data was disclosed that way). The user’s Git client requests a specific repository and provides credentials, an SSH key or an account password, to the Git proxy. The Git proxy checks the user’s credentials and confirms that the user has the ability to read the repository he or she has requested. At this point, if the Git proxy gets an unexpected response from its MySQL connection, the authentication (which user is it?) or authorization (what can they access?) check will simply fail and return an error. Many users were told during the incident that their repository access “was disabled due to excessive resource use.”

In the operations that disclosed repository data, the authentication and authorization step succeeded. Next, the Git proxy performs a routing query to see which file server the requested repository is on, and what its file system path on that server will be. This is the step where incorrect results from MySQL led to repository disclosures. In a small fraction of cases, two or more routing queries ran on the same Git proxy at the same time and received incorrect results. When that happened, the Git proxy got a file server and path intended for another request coming through that same proxy. The request ended up routed to an intact location for the wrong repository. Further, the information that was logged on the repository access was a mix of information from the repository the user requested and the repository the user actually got. These corrupted logs significantly hampered efforts to discover the extent of the disclosures.

Once the Git proxy got the wrong route, it forwarded the user’s request to git-daemon and ultimately Git, running in the directory for someone else’s repository. If the user was retrieving a specific branch, it generally did not exist, and the pull failed. But if the user was pulling or cloning all branches, that is what they received: all the commits and file objects reachable from all branches in the wrong repository. The user (or more often, their build server) might have been expecting to download one day’s commits and instead received some other repository’s entire history.

Users who inadvertently fetched the entire history of some other repository, surprisingly, may not even have noticed. A subsequent “git pull” would almost certainly have been routed to the right place and would have corrected any overwritten branches in the user’s working copy of their Git repository. The unwanted remote references and tags are still there, though. Such a user can delete the remote references, run “git remote prune origin,” and manually delete all the unwanted tags. As a possibly simpler alternative, a user with unwanted repository data can delete that whole copy of the repository and “git clone” it again.

Next steps

To prevent this from happening again, we will modify the database driver to detect and only interpret responses that match the packet IDs sent by the database. On the application side, we will consolidate the connection pool management so that Active Record's connection pooling will manage all connections. We are following this up by upgrading the application to a newer version of Rails that doesn't suffer from the "connection reuse" problem.

We will continue to analyze the events surrounding this incident and use our investigation to improve the systems and processes that power GitHub. We consider the unauthorized exposure of even a single private repository to be a serious failure, and we sincerely apologize that this incident occurred.

You can now use GitHub Projects at the Organization level. All users in your Organization will have access to its Projects, so you and your team can plan and manage work across repositories. With organization-wide Projects, everyone can see what's already in motion and work together without duplicating efforts.

Organization-wide projects can contain issues and pull requests from any repository that belongs to an organization. If an organization-wide project includes issues or pull requests from a repository that you don't have permission to view, you won't be able to see it.

Projects launched in September 2016. Check out the documentation to see how you can use them, and stay tuned—there's more to come.

To highlight the people behind projects we admire, we bring you the GitHub Developer Profile blog series.

Hiroshi “Nahi” Nakamura, currently a Site Reliability Engineer (SRE) and Software Engineer at Treasure Data, is a familiar face in Ruby circles. Over the last 25 years, he has not only grown his own career but also supports developers all over the world as a Ruby code contributor. We spoke to Nahi about his work with Ruby and open source, as well as his inspiration for getting started as a developer.

You’ll notice this interview is shared in both Japanese (which the interview was conducted in) and English—despite our linguistic differences, open source connects people from all corners of the globe.

Aki: Give me the brief overview—who is Nahi and what does he do?

簡単に自己紹介をお願いします。中村浩士さんというのは、どんな方で、何をなさっている方でしょうか？

Nahi: I have been an open source software (OSS) developer since I encountered Ruby in 1999, as well as a committer to CRuby and JRuby. Right now, I am an SRE and software engineer at Treasure Data.

1999年にRubyと出会って以来のOSS開発者で、CRuby、JRubyのコミッタです。現在勤めているTreasure Data Inc.という会社では、SRE兼ソフトウェアエンジニアをやっています。

Aki: How long have you been developing software?

今までどのくらいの期間に渡ってソフトウエアの開発を行ってこられたのでしょうか？

Nahi: I started to write my first Basic program when I was about twelve. During college, I began work at a Japanese system development company, and for the past 25 years, I’ve worked in software development at various companies and projects.

初めてBasicでプログラムを書き始めたのは12才の頃でした。大学在学中に日本のシステム開発会社でアルバイトを始め、以後様々な会社、プロジェクトで25年ほど、ソフトウェア開発に携わっています。

Aki: Who did you look up to in your early days?

ソフトウエア開発を始められた当初、どなたを尊敬されていたか教えて頂けますか？

Nahi: The research lab that I was part of in college had wonderful mentors. In addition, Perl and Common Lisp (of course!) had open source code and taught me that I could freely enhance those programming languages by myself.

The first addition that I made was to Perl (version 4.018), and I believe it was an enhancement on string processing to make it faster. Each program that runs Perl benefited from the change, and though it was small, it gave me an incredible feeling of accomplishment.

Since then, I have had great respect for the creator of the Perl programming language, Larry Wall, whose work has provided me with opportunities like this.

大学で在籍していた研究室には素晴らしい先輩がたくさんいて、PerlやCommon Lispなどのプログラミング言語にも（もちろん！）ソースコードがあり、自分で自由に拡張できることを教えてくれました。

はじめて拡張したのはPerl（version 4.018）で、ある文字列処理の高速化だったと思います。Perlで動く各種プログラムすべてがよい影響を受け、小さいながらも、素晴らしい達成感を得られました。

その頃から、このような機会を与えてくれた、Perl作者のLarry Wallさんを尊敬しています。

Aki: Tell us about your journey into the world of software development (first computer, first project you contributed to, first program you wrote?)

ソフトウエア開発の世界に入って行かれた頃のお話をお聞かせ頂けますか？（最初に使ったコンピューター、最初に参画されたプロジェクト、最初に書いたプログラム等）

Nahi: I discovered Ruby shortly after I started to work as a software engineer. Until then, I had written in languages like C, C++, and SQL for software for work, and in Perl for my own development support tools.

Without a strong understanding of object-oriented programming, I studied and picked up tools on my own and started contributing to projects. Back then the Ruby community was small, and even a neophyte like myself had many opportunities to interact with brilliant developers working on the project. Through Ruby, I learned many things about software development.

The first open source (we called it ‘free software’ back then) Ruby program I distributed was a logger library. To this day, whenever I type require ‘logger’ in Ruby, it reminds me of that embarrassing code I wrote long ago. The logger library distributed along with Ruby today no longer shows any vestiges of the previously-existing code—it has evolved magnificently, molded into shape on a variety of different platforms and for a variety of different use cases.

ソフトウェアエンジニアとして働き始めてしばらくして、Rubyに出会いました。それまでは、C、C++、SQLなどで仕事用のソフトウェアを書き、Perlで自分向けの開発支援ツールを書いていました。

オブジェクト指向のなんたるかもよくわからず、勉強がてらそれらツールを移植していき、またRubyコミュニティの流儀にしたがって配布し始めました。その頃はRubyコミュニティも小さく、私のような新参者でも、Rubyコミュニティにいた素晴らしい開発者たちに触れ合える機会が多くあり、Rubyを通じ、ソフトウェア開発のいろいろなことを学びました。

最初にOSS（その頃はfree softwareと呼んでいました）として配布したRubyのプログラムは、ログ取得ライブラリです。今でもRubyでrequire 'logger'すると、いつでも昔の恥ずかしいコードを思い出すことができます。今Rubyと共に配布されているものは、いろいろなプラットフォーム、いろいろな用途の元で叩かれて、立派に成長しており、その頃の面影はもうありません。

Aki: What resources did you have available when you first got into software development?

ソフトウエア開発を始められた当初、お使いになっていたリソースがどのようなものだったか教えて頂けますか？

Nahi: I wrote SQL, Common Lisp, C—and everything on vi and Emacs. Perl was easy to modify and worked anywhere, so I really treasured it as a resource in my software developer’s toolbelt.

SQL、Common Lisp、C、なんでもviとemacsで書いていました。ソフトウェア開発者のツールベルトに入れる道具として、どこでも動き、変更がし易いPerlは大変重宝しました。

Aki: What advice would you give someone just getting into software development now?

ソフトウエア開発の世界に入ったばかりの方に、どのようなアドバイスを差し上げますか？

Nahi: I think that I came to be the software engineer I am today by participating in the open source community with loads of great developers and engaging in friendly competition with them, as well as trying out the knowledge I learned from the community in my professional life. As opposed to when I first came across Ruby, there are several unique communities now and a great deal of opportunities to leverage them professionally. I really don’t have much advice to share, but I hope that everyone will seek the opportunity to get to know a lot of great engineers.

ソフトウェア開発者としての私は、よい技術者がたくさん居るOSSコミュニティに参加し、彼らの切磋琢磨に参加することと、そこで得た経験を業務で試した経験により作られたと思っています。でも、私がRubyと出会った頃とは違い、今はそのようなコミュニティがたくさんありますし、それを業務に活かすチャンスもたくさんありますね。私ができるアドバイスはほとんどありません。みなさんがよい技術者とたくさん知り合えることを祈っています。

Aki: If you forgot everything you knew about software development, and were to start learning to code today, what programming language might you choose and why?

もしソフトウエア開発に関して現在お持ちの知識を全て忘れて、今日からプログラミングを学ぶこととなった場合、どのプログラミング言語を選びますか？またその理由を教えて頂けますか？

Nahi: I would choose either Ruby or Python. If I still knew what I know now, it would be Python. I would select a language in which the OS and network are hidden only behind a thin veil and easily identified.

RubyかPythonを選びます。もし現在の知識が残っていればPythonですね。薄い皮の下に、OSやネットワークがすぐに見えるような言語をまた選びたいと思います。

Aki: On that note, you make a huge impact as part of Ruby's core contributing team. How specifically did you get started doing that?

Rubyのコアコントリビュートチームの一員として、（コミュニティーに）大きなインパクトを与えてこられましたが、具体的にどのような形／きっかけで（Rubyコミュニティーへの貢献を）始められたか教えて頂けますか？

Nahi: After releasing my first open source software, I went on to release several Ruby libraries that I made for work, such as network proxy, csv, logger, soap, httpclient, and others. With Ruby 1.8, Matz (Yukihiro “Matz” Matsumoto, the chief designer of Ruby) put a policy in place to expand the Standard Library in order to spread Ruby. This allowed the user to do everything they needed to do without additional libraries by simply installing it. A number of the libraries that I had made were chosen as candidates at the time, and I have mainly maintained the Standard Library ever since. The announced policy to expand the Standard Library was a great coincidence for me, since it allowed me to build experience.

初めてOSSで公開して以後、業務で使うために作ったRubyのライブラリをいくつか公開していきました。network proxy、csv、logger、soap、httpclientなど。Ruby 1.8の時、MatzがRubyを広めるために、標準添付ライブラリを拡充する方針を立てました。インストールすれば、追加ライブラリなしに一通りのことができるようにしよう、というわけです。その際に、私の作っていたライブラリもいくつか候補に選ばれ、以後主に、標準ライブラリのメンテナンスをするようになりました。標準添付ライブラリ拡充方針は、私が経験を積むことが出来たという点で、大変よい偶然でした。

Aki: For new contributors to Ruby, what do you think is the most surprising thing about the process?

Rubyの新たなコントリビューターの方にとって、（Rubyコミュニティーの開発）プロセスに関し、どのような部分が最も驚きのある部分とお考えになりますか？

Nahi: To be honest, I haven’t been able to contribute to Ruby itself over the past few years, so I am not aware of the details on the specific development process. However, I think the most surprising part is that it clearly does not look like there is a process.

In reality, a group of core contributors discuss and make decisions on the direction of development and releases, so to contribute to Ruby itself, you must ultimately propose an idea or make a request to those core contributors.

That’s the same with any community, though. One defining characteristic of the process might be that the proposals can be fairly relaxed, as there is no culture of creating formal documents.

正直に言うと、この数年はRubyそのものへのコントリビュートを行えていないので、具体的な開発プロセス詳細については把握していません。が、明らかに、プロセスがあるように見えないのが、一番驚きのある部分だと思います。

実際には、開発の方向性決定、リリースの決定については、一部のコアなコントリビュータが相談しつつ行っていて、Rubyそのものへのコントリビュートは、最終的には彼/彼女らに対する提案、要望となる必要があります。でもそれは、どのコミュニティでも同じですね。文書化の文化がない分、提案もわりとルーズで構わないのは特徴かもしれません。

Aki: Okay, we have to ask. What is the most interesting pull request you've received for Ruby?

お尋ねしなくてはならないことなのですが。。今までRubyの開発を行ってこられたなかで、（中村さんが）お受けになった最も興味深い／面白いPull Requestはどのようなものでしょうか？

Nahi: While not necessarily a “pull request,” I have received all sorts of suggestions that stand out: replacing the Ruby execution engine, swapping out the regular expression library, gemifying the Standard Library, etc. As for the most memorable pull request I have received personally, one was a request to swap out the CSV library I made for a different high-speed library. When I think about it with a clear mind, it was a legitimate request, but it took forever to make the right decision.

"Pull request"という名前ではありませんが、印象深いものはたくさんあります。Ruby実行エンジンの差し替え、正規表現ライブラリの置き換え、標準ライブラリのgem化など。私個人に関するものとしては、自身の作ったcsvライブラリを、別の高速ライブラリで置き換えたい、というリクエストが一番印象深いものでした。冷静に考えて正しいリクエストでしたが、適切な判断をするために、いちいち時間がかかりました。

Aki: Outside of your open source work, you also work full time as a developer. Does your participation in open source inform choices you make at work? How?

Open Sourceに関する活動とは別に、フルタイムのソフトウエア開発者としてご勤務されていますが、Open Sourceコミュニティへの参加は職場における（日々の）意思決定にどのような影響を与えていますか？

Nahi: Active involvement in open source is one of the pillars of business at the company I currently work for, and it informs the choices the other engineers and I make unconsciously. When developing something new for the business, we never begin work on a project without examining existing open source software and the open source community. As much as possible, we try not to make anything that replicates what something else does. However, if we believe it necessary, even if existing software does the same thing, we make products the way they should be made. Then, we compete with that and contribute our version back to the world as open source. The experiences and knowledge that we pick up, and also give back through the process, is the lifeblood of software development.

Until I came to my current company a year and a half ago, I led dozens of system development projects, mainly as a technical architect in the enterprise IT world for about 15 years. Back then, I participated in open source individually rather than at my company.

現在所属している会社は、Open Sourceへの積極的な関与をビジネスの柱の一つとしていることもあり、特に意識せずとも、私および各エンジニアの意思決定に影響を与えています。ビジネスのため、何か新しい物を開発する時、既存のOpen Sourceソフトウェア、またOSSコミュニティの調査なしに作り始めることはありません。可能な限り、用途が重複するものは作りません。しかしそうと信じれば、用途が同じでも、あるべき姿のものを作ります。そしてそれは、Open Sourceとして世の中に還元し、競争していきます。そのような中で得られる、また提供できる経験、知見は、ソフトウェア開発の血液のようなものです。

唐突ですが、1年半前に現在の会社に来る前までは、15年ほど、エンタープライズITの世界で、主にテクニカルアーキテクトとして数十のシステム開発プロジェクトをリードしていました。その頃は、会社ではなく個人でOSS活動を行っていました。

Aki: Tell us about your view on where the enterprise IT world is lagging behind. How do you see the open source developer community making a contribution to change that?

エンタープライズITの世界がどのような点で（Open Source等の世界）から遅れているとお考えになるか教えて頂けますか？Open Sourceコミュニティーのソフトウエア開発者の方々が、（エンタープライズITの状況を）変革させることに、どのような貢献ができるとお考えになっているか教えて頂けますか？

Nahi: In the enterprise IT world, we were trying to create a future that was predictable in order to control the complexity of business and the possibility of change. Now, however, it is hard to predict what things will be like one or two years down the road. The influence of this unpredictability is growing so significant that it cannot be ignored. Luckily, I was given the opportunity to lead a variety of projects, and what helped me out then was the experiences and knowledge I had picked up by being involved in the open source community.

To be honest, developers participating in the open source community now have already made a variety of contributions to the enterprise IT world, and I am one of those beneficiaries. To enhance the software development flow, developers in the enterprise IT world need to participate more in open source. I would venture to say that establishing such an environment and showing understanding towards it may be thought of as further contributions on the enterprise side.

エンタープライズITの世界では、ビジネスの複雑さと変更可能性をcontrolするため、予測可能な未来を作ろうとしていました。しかし今では、1年、2年後を予測するのは困難です。この予測できないことの影響は、無視できないほど大きくなっています。私は幸いにも、各種プロジェクトをリードする機会を与えられました。その時に役立ったのは、Open Sourceコミュニティとの関わりの中で得られた経験、知見でした。

正直に言うと、現在Open Sourceコミュニティに参加している開発者は、エンタープライズITの世界に、既に様々な貢献をされていると思います。私もその恩恵を受けた一人です。ソフトウェア開発の血液を循環させるためには、エンタープライズITの世界に居る開発者が、もっとOpen Sourceコミュニティに参加できるようにならないといけません。しいて言えば、そのような環境を整えること、理解を示すこと、などは、更なる貢献として考えられることかもしれません。

To learn more about Nahi’s contributions to Ruby, visit his GitHub profile page here. You can also learn more about Ruby itself by visiting their homepage.