Open Source Blog
News about Google's open source student programs and software releases
GitHub on BigQuery: Analyze all the code
Wednesday, June 29, 2016
Posted by
Felipe Hoffa
, Google Developer Advocate
Google, in collaboration with GitHub, is releasing an incredible new open dataset on
Google BigQuery
. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks
GitHub Archive project
!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?
The
Google BigQuery Public Datasets
program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.
For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.
On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can. Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
With BigQuery everyone gets
a terabyte every month to run queries
. If you've never tried BigQuery before, follow these
getting started instructions
.
The contents table has all the non-binary files in GitHub that are less than 1MB. It's a huge table, with more than 1.5 terabytes of data! This means the monthly terabyte for BigQuery queries won't last long if you want to query this table. To make your life easier, we've created extracts with only
a sample of 10% of all files of the most popular projects
, as well as another dataset with
all the .go, .rb. .js, .php, .py, and .java code
. Use them to make your free quota last!
If these tables are not enough, you can always create your own extracts (but you'll be billed for the respective storage). To do so, you could sign up for $300 in
Google Cloud Platform
credits. These credits could be used to store terabytes (and more) of data in BigQuery.
BigQuery makes it easy to join different datasets. How about ranking coding patterns by the number of stars their projects get? See a related post looking at the
Hacker News effect on a project’s GitHub stars
.
SQL is not enough? Learn how BigQuery allows you to run arbitrary
JavaScript code inside SQL
to enable a full range of possibilities.
To learn more, read
GitHub's announcement
and try some
sample queries
. Share your queries and findings in our
reddit.com/r/bigquery
and
Hacker News
posts. The ideas are endless, and I'll start collecting tips and links to other articles on this
post on Medium
.
Stay curious!
More statistics from Google Summer of Code 2016
Tuesday, June 28, 2016
Google Summer of Code
(GSoC) 2016 is officially at its halfway point. Mentors and students have just completed their midterm evaluations and it’s time for our second
stats post
. This time we take a closer look at our participating students.
First, we’d like to highlight the universities with the most student participants. Congratulations are due to the International Institute of Information Technology - Hyderabad for claiming the top spot for the third consecutive year!
Country
School
2016 Accepted Students
2015 Accepted Students
12 Year Total
India
International Institute of Information Technology - Hyderabad
50
62
252
Sri Lanka
University of Moratuwa
29
44
320
Romania
University POLITEHNICA of Bucharest
24
14
155
India
Birla Institute of Technology and Science Pilani, Goa Campus
22
15
110
India
Birla Institute of Technology and Science, Pilani Campus
22
18
116
India
Indian Institute of Technology, Bombay
18
13
75
India
Indian Institute of Technology, Kharagpur
15
8
92
India
Indian Institute of Technology, Roorkee
15
8
57
India
Indraprastha Institute of Information Technology Delhi
15
7
27
India
Amrita School of Engineering, Amrita University, Amritapuri Campus
13
5
33
India
Indian Institute of Technology, Guwahati
13
5
38
Cameroon
University of Buea
12
10
26
India
Delhi Technological University
12
9
60
India
Indian Institute of Technology BHU Varanasi
12
12
37
Germany
TU Munich
11
7
45
Next, we are proud to announce that 2016 marks the largest number of female GSoC participants to date — 12% of accepted students are female, up 2.2% from 2015. This is good progress, but we are certain we can do better in the future to diversify our program. The Google Open Source team will continue our outreach to many organizations, for example,
Grace Hopper
and
Black Girls Code
, to increase this number even more 2017. If you have any suggestions of organizations we should work with, please let us know in the comments.
Finally, each year we like to look at the majors of students. As expected, the most common area of study for our participants is Computer Science (approximately 78%), but this year we have a wide variety of studies including Linguistics, Law, Music Technology and Psychology. The majority of our students this year are undergraduates (67%), followed by Masters (23%) and then PhD students (9%).
Although reviewing GSoC statistics each year is great fun, we want to stress that being “first place” is not the point of the program. Our goal is to get more and more students involved in creating free and open source software. We hope Google Summer of Code encourages contributions to projects that have the potential to make a difference worldwide. Congratulations to the students from all over the globe and keep up the good work!
By Mary Radomile, Open Source Programs Office
Coding has begun for Google Summer of Code 2016
Monday, May 23, 2016
Today marks the start of coding for the 12th annual
Google Summer of Code
. With the community bonding period complete,
about 1,200 students
now begin 12 weeks of writing code for
178
different open source organizations.
We are excited to see the contributions this year’s students will make to the open source community.
For more information on important dates for the
program
please visit our
timeline
. Stay tuned as we will highlight some of the new mentoring organizations over the next few months.
Have a great summer and happy coding!
By Josh Simmons, Open Source Programs Office
Google Summer of Code 2016 statistics: Part one
Monday, May 23, 2016
We share statistics from
Google Summer of Code
(GSoC) every year — now that 2016 is chugging along we’ve got some exciting numbers to share! 1,206 students from all over the globe are currently in the community bonding period, a time where participants learn more about the organization they will be contributing to before coding officially begins on May 23. This includes becoming familiar with the community practices and processes, setting up a development environment, or contributing small (or large) patches and bug fixes.
We’ll start our statistics reporting this year with the total number of students participating from each country:
Country
Accepted Students
Country
Accepted Students
Country
Accepted Students
Albania
1
Greece
10
Romania
31
Algeria
1
Guatemala
1
Russian Federation
52
Argentina
3
Hong Kong
2
Serbia
2
Armenia
3
Hungary
7
Singapore
7
Australia
6
India
454
Slovak Republic
3
Austria
19
Ireland
3
Slovenia
4
Belarus
5
Israel
2
South Africa
2
Belgium
5
Italy
23
South Korea
6
Bosnia-Herzegovina
1
Japan
12
Spain
33
Brazil
21
Kazakhstan
2
Sri Lanka
54
Bulgaria
2
Kenya
3
Sweden
5
Cambodia
1
Latvia
3
Switzerland
2
Cameroon
16
Lithuania
1
Taiwan
7
Canada
23
Luxembourg
1
Thailand
1
China
34
Macedonia
1
Turkey
12
Croatia
2
Mexico
2
Ukraine
13
Czech Republic
6
Netherlands
9
United Kingdom
18
Denmark
2
New Zealand
2
United States
118
Egypt
10
Pakistan
4
Uruguay
1
Estonia
1
Paraguay
1
Venezuela
1
Finland
3
Philippines
2
Vietnam
4
France
19
Poland
28
Germany
66
Portugal
7
We’d like to welcome a new country to the GSoC family. 2016 brings us one student from Albania!
In our upcoming statistics posts, we will delve deeper into the numbers by looking at universities with the most accepted students, gender numbers, mentor countries and more. If you have additional statistics that you would like us to share, please leave a comment below and we will consider including them in an upcoming post.
By Mary Radomile, Open Source Programs
Correction: A previous version of this blog post erroneously reported the total number of students as 1,202 and the number of students from Cameroon as 1. This has been updated to reflect the actual totals as 1,206 and 16 respectively.
Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source
Friday, May 13, 2016
Originally posted on the
Google Research Blog
By Slav Petrov, Senior Staff Research Scientist
At Google, we spend a lot of time thinking about how
computer systems
can
read
and
understand
human language
in order
to process it
in
intelligent ways
. Today, we are excited to share the fruits of our research with the broader community by releasing
SyntaxNet
, an open-source neural network framework implemented in
TensorFlow
that provides a foundation for
Natural Language Understanding
(NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as
Parsey McParseface
, an English parser that we have trained for you and that you can use to analyze English text.
Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence. Because Parsey McParseface is the
most accurate such model in the world
, we hope that it will be useful to developers and researchers interested in automatic extraction of information, translation, and other core applications of NLU.
How does SyntaxNet work?
SyntaxNet is a framework for what’s known in academic circles as a
syntactic parser
, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. These syntactic relationships are directly related to the underlying meaning of the sentence in question. To take a very simple example, consider the following dependency tree for
Alice saw Bob
:
This structure encodes that
Alice
and
Bob
are nouns and
saw
is a verb. The main verb
saw
is the root of the sentence and
Alice
is the subject (nsubj) of
saw
, while
Bob
is its direct object (dobj). As expected, Parsey McParseface analyzes this sentence correctly, but also understands the following more complex example:
This structure again encodes the fact that
Alice
and
Bob
are the subject and object respectively of
saw
, in addition that
Alice
is modified by a relative clause with the verb
reading
, that
saw
is modified by the temporal modifier
yesterday
, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example
whom did Alice see?
,
who saw Bob?
,
what had Alice been reading about?
or
when did Alice see Bob?
.
Why is Parsing So Hard For Computers to Get Right?
One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence
Alice drove down the street in her car
has at least two possible dependency parses:
The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition
in
can either modify
drove
or
street
; this example is an instance of what is called
prepositional phrase attachment ambiguity
.
Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence. Usually the vast majority of these structures are wildly implausible, but are nevertheless possible and must be somehow discarded by a parser.
SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use
beam search
in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration. An example of a left-to-right sequence of decisions that produces a simple parse is shown below for the sentence
I booked a ticket to Google
.
Furthermore, as described in our
paper
, it is critical to tightly
integrate learning and search
in order to achieve the highest prediction accuracy. Parsey McParseface and other
SyntaxNet
models are some of the most complex networks that we have trained with the
TensorFlow
framework at Google. Given some data from the Google supported
Universal Treebanks
project, you can train a parsing model on your own machine.
So How Accurate is Parsey McParseface?
On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old
Penn Treebank
), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already
better than any previous approach
. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the
Google WebTreebank
(released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.
While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across
all
languages and contexts.
To get started, see the
SyntaxNet
code and download the Parsey McParseface parser model. Happy parsing from the main developers, Chris Alberti, David Weiss, Daniel Andor, Michael Collins & Slav Petrov.
Googlers on the road: OSCON 2016 in Austin
Monday, May 9, 2016
Developers and open source enthusiasts converge on Austin, Texas in just under two weeks for O’Reilly Media’s annual open source conference,
OSCON
, and the
Community Leadership Summit (CLS)
that precedes it. CLS runs May 14-15 at the Austin Convention Center followed by OSCON from May 16-19.
OSCON 2014 program chairs including Googler Sarah Novotny.
Photo licensed by O'Reilly Media under
CC-BY-NC 2.0
.
This year we have 10 Googlers hosting sessions covering topics including web development, machine learning, devops, astronomy and open source. A list of all of the talks hosted by Googlers alongside related events can be found below.
If you’re a student, educator, mentor, past or present participant in
Google Summer of Code
or
Google Code-in
, or just interested in learning more about the two programs, make sure to join us Monday evening for our
Birds of a Feather
session.
Have questions about Kubernetes, Google Summer of Code, open source at Google or just want to meet some Googlers? Stop by booth #307 in the
Expo Hall
.
Thursday, May 12th - GDG Austin
7:00pm
Google Developers Group Austin
Meetup
Sunday, May 15th - Community Leadership Summit
10:00am
Occupational Hazard
by
Josh Simmons
Monday, May 16th
9:00am
Kubernetes: From scratch to production in 2 days
by
Brian Dorsey
and
Jeff Mendoza
7:00pm
Google Summer of Code and Google Code-in
Birds of a Feather
Tuesday, May 17th
9:00am
Kubernetes: From scratch to production in 2 days
by
Brian Dorsey
and
Jeff Mendoza
9:00am
Diving into machine learning through TensorFlow
by
Julia Ferraioli
,
Amy Unruh
and Eli Bixby
Wednesday, May 18th
1:50pm
Open source lessons from the TODO Group
by
Chris DiBona
,
Chris Aniszczyk
,
Nithya Ruff
,
Jeff McAffer
and
Benjamin VanEvery
5:10pm
Scalable bidirectional communication over the Web
by
Wenbo Zhu
Thursday, May 19th
11:00am Kubernetes hackathon at
OSCON Contribute
hosted by
Brian Dorsey
, Nikhil Jindal,
Janet Kuo
,
Jeff Mendoza
, John Mulhausen,
Sarah Novotny
,
Terrence Ryan
and Chao Xu
2:40pm
Blocks in containers: Lessons learned from containerizing Minecraft
by
Julia Ferraioli
5:10pm
PANOPTES: Open source planet discovery
by
Jennifer Tong
and Wilfred Gee
5:10pm
Stop writing JavaScript frameworks
by
Joseph Gregorio
Haven’t registered for OSCON yet? You can knock 25% off the cost of registration by using discount code
Google25
, or
attend parts of the event
including our Birds of a Feather session for free by using discount code
OSCON16XPO
.
See you at OSCON!
By Josh Simmons, Open Source Programs Office
XRay: a function call tracing system
Tuesday, May 3, 2016
At Google we spend a lot of time debugging and tuning the performance of our production systems. Some standard practices when doing this involves using profilers, debuggers, and analysis of logs and execution traces. Doing this at scale, in production, is difficult. One of the ways for getting high fidelity data from production systems is to build applications with instrumentation, and then reconstruct the instrumentation data into a form humans can consume (summary statistics, reports, etc.). Instrumentation comes at a cost though, sometimes too high to make it feasible to deploy in production.
Getting this balance right is hard. This is why we've developed
XRay
, a function call tracing system that has very little overhead when not enabled, but can be dynamically turned on and only impose moderate costs. XRay works as a combination of compiler-inserted instrumentation points which functionally do nothing (called "nop sleds") and a library that can be enabled and disabled at runtime which replaces the nop sleds with the appropriate instrumentation instructions.
We've been using XRay to debug internal systems, from core infrastructure services like
Bigtable
to ad serving systems. XRay's detailed function tracing has enabled several teams in Google to debug issues that would be really hard to solve without XRay.
We think XRay is an important piece of technology, not only at Google, but for developers around the world. It's because of this that we're working on making XRay opensource. To kick-start that process, we're releasing a
white paper describing the technical details of XRay
. In the following weeks, we will be engaging the
LLVM
community, where we are committed to making XRay available for wide use and distribution.
We hope that by open-sourcing XRay we can contribute to the advancement of debugging real-world applications. We're looking forward to working with the LLVM community and other projects to make the data XRay generates useful for debugging a wide variety of applications.
By Dean Michael Berris, Google Engineering
Students announced for Google Summer of Code 2016
Friday, April 22, 2016
It's that time of year again:
1,206 students
have been accepted for our 2016
Google Summer of Code
! Congratulations all around. We want to thank everyone who applied — it was another competitive year with
178 mentoring organizations
receiving 7,543 proposals from 5,107 students.
Now we enter the community bonding period when students get acquainted with their mentors and familiarize themselves with their new community before they begin coding in May. In this period, students will do things like hang out in IRC channels and read documentation, become familiar with the code base and set their deadlines and milestones with their mentors.
If you want to review important dates or learn more about the 178 organizations that the accepted students will be working with over the summer, please visit the
program website
.
Here's to another exciting and productive summer of contributing to open source.
By Josh Simmons, Open Source Programs Office
CCTZ v2.0 — now with more civil time
Tuesday, April 12, 2016
Last September we
announced
an open source project called
CCTZ
, a C++ library that enables computing with arbitrary time zones. Today we're announcing CCTZ v2.0 which introduces a new civil time library. Civil time is a
legally recognized
representation of time used by humans (i.e., year, month, day, hour, minute and second). The most common example of a civil time is a time zone independent date. In version 2.0, CCTZ's time zone and new civil time libraries cooperate with the standard C++
<chrono>
library to give programmers a complete (and simple!) framework in which to reason about and solve even the most complicated time programming problems.
To learn more, please check out the
project page on GitHub
. Pay particular attention to the
fundamental concepts
section which establishes a simple, cross-platform and language agnostic mental model that will help you reason about time programming challenges with ease and confidence. And don't forget to subscribe to the new
CCTZ mailing list
to ask questions and learn about future announcements.
by Greg Miller and Bradley White, Google Engineering
Google Summer of Code marches on!
Friday, April 1, 2016
Google Summer of Code 2016
(GSoC) is well underway and we’ve already seen some impressive numbers — all record highs!
18,981 total registered students (up 36% from 2015)
17.34% female registrants
142 countries
5107 students submitting 7,543 project proposals
Student proposals are currently being reviewed by over 2300 mentors and organization administrators from the
180 participating mentor organizations
. We will announce accepted students on April 22, 2016
on the Open Source blog and on the
program site
.
Last week, members of the Google Open Source Programs team attended
FOSSASIA
in Singapore, Asia’s premier open technology event, to talk about GSoC and
Google Code-in
. There, we met dozens of former GSoC and GCI students and mentors who were excited to embark on another great year. To learn more about Google Summer of Code, please visit our
program site
.
By Stephanie Taylor, Open Source Programs
Labels
gsoc
415
releases
172
conference
89
gci
76
ghop
55
meetups
49
Linux
28
GSoC Meetups
25
Python
25
project hosting
24
hackathon
21
students
17
App Engine
16
C++
16
Git
14
OSCON
14
library
13
Eclipse
12
games
12
GNOME
11
KDE
11
testing
11
Android
10
JavaScript
10
R
10
BSD
9
Java
9
accessibility
9
open source release
9
security
9
Chrome
8
Go
8
HTML5
8
Subversion
8
awards
8
education
8
Chromium
7
GSoC 10 Things
7
Google Earth
7
Selenium
7
database
7
licensing
7
maps
7
usability
7
Django
6
Google I/O
6
Samba
6
contest
6
documentation
6
Free Software Foundation
5
GCC
5
Gerrit
5
events
5
fonts
5
government
5
machine learning
5
standards
5
Creative Commons
4
Dart
4
GNU
4
GitHub
4
Google Cloud Platform
4
Haskell
4
Perl
4
mobile
4
protocol buffers
4
science
4
season of usability
4
statistics
4
webdriver
4
C
3
CSS
3
Google Compute Engine
3
JSON
3
Mercurial
3
PHP
3
Unicode
3
fun propulsion lab
3
internationalization
3
patents
3
translation
3
Objective-C
2
deep learning
2
ios
2
time zones
2
BigQuery
1
Kubernetes
1
Neural Networks
1
algorithms
1
artificial intelligence
1
bazel
1
big data
1
cardboard
1
clojure
1
compression
1
debugging
1
gmail
1
hardware
1
k8s
1
lisp
1
logo
1
melange
1
metabrainz
1
musicbrainz
1
natural language
1
nmap
1
open data
1
performance
1
research
1
sugar labs
1
ui automation
1
zopfli
1
Archive
2016
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Feed