Google Safe Browsing API

Safe Browsing API
- Developer's Guide
- Reference Guide

Developer's Guide

The Safe Browsing API is an experimental API that allows client applications to check URLs against Google's constantly-updated blacklists of suspected phishing and malware pages. Your client application can use the API to download an encrypted table for local, client-side lookups of URLs.

In addition to providing some background on the capabilities of the Safe Browsing API, this document provides examples for interacting with the API by sending HTTP messages to download lists or perform lookups. This document also includes information and examples on how to perform client-side lookups from a downloaded list.

Audience

This document is intended for programmers who want to use Google anti-phishing and anti-malware data to protect users from potentially malicious websites. It provides a series of examples of basic data API interactions, a set of guidelines for how the data may be accessed, and information on how the data can be used. Please note that the data format is likely to change in the future.

Overview

Google publishes phishing and malware data in 2 separate blacklists (goog-black-hash and goog-malware-hash). Each is a list of md5 hash values. The client should keep a local copy of the lists and consult them for every URL the user visits. The client should store the lists as it receives them and make no attempt to convert a hashed list to plaintext. Clients should optimally update their lists every 30 minutes by contacting Google to check for new data. Your application is not permitted to show warnings to end users unless it has requested an update in the last 30 minutes without receiving an error response. Also, if you do show warnings to end users, you must adhere to Google's guidelines for the warning text and provide appropriate attribution as discussed below under End User Visible Warnings.

We will limit the number of users you can support with a single API key. If you expect to have more than 10,000 users sending regular requests to the API, you must contact us by sending email to antiphish-malware-cap-req@google.com.

Lists are versioned with a major and minor number. The major version is currently 1, and is used to describe the wire format for serializing the lists (see below). The minor number indicates the version of the list. When we add or remove items from a list, we increment its minor version number.

The client should record which lists it knows about and their version numbers (-1 for no existing version). To request an update from the update server, the client should use GET with the desired list names and version numbers as a query parameter: version=type:major:minor[,type:major:minor]*. For example:

http://sb.google.com/safebrowsing/update?client=api&apikey=<yourkey>&version=goog-black-hash:1:432,goog-malware-hash:1:-1

The server responds with updates to all lists in the wire format. For each list the response includes either a completely new list or a diff between the client's version of the list and the most current version, whichever is smaller. If there is no change to a list since the last update, the server will not include it in the response.

Getting Started

First you need to request an API key, which will authenticate you as an API user. In order to obtain an API key, you must have a Google account. You may create a Google account or log in with your existing Google account and sign up for the API at http://www.google.com/safebrowsing/key_signup.html

Please note that if you violate the requirements detailed in the Acceptable Usage in Clients section, your key may be disabled for a period of time.

Next you should decide whether you wish to use a Message Authentication Code (MAC) for your list updates. A MAC allows you to verify the integrity of the data that you receive from the server. If you choose to use a MAC, you must first make a getkey request as described below. Otherwise, you only need to use the update request.

GetKey Requests (optional)

Description

The getkey request may be used at client startup to create a shared secret key between the client and the server. The secret key is optional and can be used to authenticate list updates. To be secure, the getkey request uses SSL.

This is the url for the getkey request:

https://sb-ssl.google.com/safebrowsing/getkey?client=api

Server Response

The server responds with key-value pairs, in this format: key:<value length>:value

In this case, the server will respond with a clientkey and wrappedkey. For example:

clientkey:24:pOAblTUiZFkLSv3xRiXKKQ==
wrappedkey:24:MTqdJvrixHRGAyfebvaQWYda

The client key is a 16-byte, base-64 encoded random nonce, generated by the server when receiving the GetKey request. The wrapped key is the random nonce encrypted by a server key. The wrappedkey is opaque to the client and a server may implement any encryption algorithm it sees fit. The wrappedkey allows the server to reconstruct the client key without requiring per-client state. It is up to the server to include verification information into the wrapped key that might allow it to determine if decrypting it was successful. If the server key changes, the server can prepend pleaserekey to responses to tell the client to request a new client key.

The GetKey request should only be called once per client, as well as once per pleaserekey response.

Update Requests

Description

The update request is used to ask for updates to the phishing or malware data. The client should provide the lists that it wants the server to update. The server either provides the full content of the current lists or incremental updates to bring the client's lists up to the current version.

This is the url for the update request:

http://sb.google.com/safebrowsing/update?client=api

Parameters

The update request takes three required parameters (client=api, apikey and version) and one optional parameter (wrkey). The apikey parameter should be the key you received by signing up for API usage (see the Getting Started section). The version parameter specifies the lists and versions that the client has, i.e. "version=goog-black-hash:1:432,goog-malware-hash:1:32". The optional wrkey parameter should be the wrapped key sent by the server in response to a getkey request and should only be used if the client wishes to receive a MAC. See the GetKey Request and Message Authentication Code for Updates sections for more information on the MAC.

Server Response

The serialized form of the lists is called the wire format. This is the form of update responses and is a simple line-oriented protocol. Clients should ignore any malformed lines.

It consists of a sequence of sections each consisting of a header line such as [type major.minor [update]][[mac=<digest>]] followed by lines of data comprising the list described by the header. If the update token appears in the header line, the data following constitutes an update to the client's existing list. If not, the data specifies a full, new list and the client should discard any old data for that list. If the client provided a wrappedkey in the request, the response must include the MAC. Here are a few possible first-line header responses:

  [goog-black-hash 1.372 update]
  [goog-black-hash 1.372]
  [goog-malware-hash 1.10][mac=iA5vLUidpXAPwfcAH9+8OQ==]
  [goog-malware-hash 1.10 update][mac=iA5vLUidpXAPwfcAH9+8OQ==]

The header line will be followed by data lines which begin with a + or -. A plus indicates an addition to the table and is followed by a tab-separated key/value pair. A minus means to remove a key from the table and is followed by the key itself.

An example update response is:

  [goog-black-hash 1.419 update]
  -5a76ceebafdc7b72883e5c8212d0b046
  +76fa3d25e1dd28913ff829143fec7aa3
  +a1b2324852d1368fbe14df5920881a08
  -b3c780524ac86cdfe51fe6709c49e8a6
  ...
  [goog-malware-hash 1.201]
  +000a8a2973c056d87ac25a6900f3a720
  +01faf1b9baf0b4f3284cc3f56b9bafb7
  +2f0cf74439cffbe0c988b034f6dd46ba
  ...

In this example, the client has some version of goog-black-hash prior to 419 and the server is telling the client to bring itself up to version 419 by applying the adds and deletes that follow. The second header indicates that a full, new table is replacing goog-malware-hash. This suggests that the client has some version of goog-malware-hash earlier than version 201, but the diff would be longer than the entire version 201 table, so it is sending a complete replacement.

Message Authentication Code for Updates

If the client provided a wrapped key in the update request, the server also needs to compute a MAC for the response data that the client can use to verify the integrity of the lists. The MAC is computed from an MD5 Digest over the following information: client_key|separator|table data|separator|client_key. The separator is the string:coolgoog: - that is a colon followed by "coolgoog" followed by a colon. The resulting 128-bit MD5 digest is websafe base-64 encoded and provided via [mac=<encoded digest>] on the first line of a table update; see below.

Here is an example that you can use to verify the MAC algorithm in your server:

client key: "8eirwN1kTwCzgWA2HxTaRQ=="

A sample query to get table data using a MAC looks like this:

http://sb.google.com/safebrowsing/update?client=api&apikey=<yourkey>&version=goog-black-
hash:1:179&wrkey=MTrILkq3LDJtWp8V8zHJaJc2

Below is a sample response including a correct MAC based on the keys provided to the server above.

[goog-black-hash 1.180 update][mac=dRalfTU+bXwUhlk0NCGJtQ==]
+8070465bdf3b9c6ad6a89c32e8162ef1	
+86fa593a025714f89d6bc8c9c5a191ac
+bbbd7247731cbb7ec1b3a5814ed4bc9d

*Note that there are tabs at the end of each line.

Acceptable Usage in Clients

If this API is used in a client-side application there are some requirements to follow when asking the server for updates.

Client Update Frequency and Back-off

The first update request should happen at a random interval between 0-5 minutes after the client starts. The second update request should happen between 15-30 minutes later. After that, each update should occur once every 25-30 minutes. Be sure that your update request includes the correct last version of the list so that the server can provide incremental updates rather then the entire list.

Providing the data on the server for update requires significant resources. To help maintain a high quality of service, it may be necessary for the update server to ask the client to make less frequent requests. To handle this, the client must watch for HTTP timeouts or errors from the server and if too many errors occur, it should increase the time between requests. In particular, if the client receives an error during update, it should try again in one minute. If it receives three errors in a row, it should skip updates until at least 60 minutes have passed before trying again. If it then receives another (fourth) error, it should skip updates for the next 180 minutes and if it receives another (fifth) error, it should skip updates for the next 360 minutes. It may continue to check once every 360 minutes until the server responds with a success message. Once the server sends a successful HTTP reply, the client should reset it's error counter. If the client has not received a successful response from the server in the last 30 minutes, the client should not show a warning to end-users.

The GetKey request should only be called once per client to establish the secret key. Subsequent GetKey requests should only be issued in response to a pleaserekey message.

Throttling Clients

If there is a large spike in traffic such that exceeds allocated capacity, clients may be throttled. When a client is throttled, the server will pretend that there are no new updates for some or all of the lists that the client requested. In other words, there is no special protocol for throttling, but a client may receive the data late if it is throttled. In this case, the client will not be able to tell that the data is stale so it may continue to show warnings.

End-user visible warnings

If you use the Google Safe Browsing API to warn users about risks from particular webpages, we require that you follow certain guidelines. These guidelines help protect both you and Google from misunderstandings by making clear that the page is not known with 100% certainty to be a phishing site or a distributor of malware, and that the warnings merely identify possible risk.

In your end-user visible warning, you may not lead users to believe that the page in question is, without a doubt, a phishing page or a page that distributes malware. When you refer to the page being identified or the potential risks it may pose to users, you must qualify the warning using terms such as: suspected, potentially, possible, likely, may be.
Your warning must enable the user to learn more by reviewing information at http://www.antiphishing.org/ (for phishing warnings) or http://www.stopbadware.org/ (for malware warnings).
When you show warnings for pages identified as risky by the Safe Browsing API, you must give attribution to Google by including the line "Advisory provided by Google," with a link to http://code.google.com/support/bin/answer.py?answer=70015. If your product also shows warnings based on other sources, you may not include the Google attribution in warnings derived from non-Google data.

Suggested phishing warning language

We encourage you to just copy this warning language in your product, or modify it slightly to fit your product.

Warning- Suspected phishing page. This page may be a forgery or imitation of another website, designed to trick users into sharing personal or financial information. Entering any personal information on this page may result in identity theft or other abuse. You can find out more about phishing from www.antiphishing.org.

Warning- Visiting this web site may harm your computer. This page appears to contain malicious code that could be downloaded to your computer without your consent. You can learn more about harmful web content including viruses and other malicious code and how to protect your computer at StopBadware.org.

Notice to Users About Phishing and Malware Protection

Our Terms of Service require that if you indicate to users that your service provides malware or phishing protection, you must also let them know that the protection is not perfect. This notice must be visible to them before they enable the protection, and it must let them know that there is a chance of both false positives (safe sites flagged as risky) and false negatives (risky sites not flagged). We suggest using the following language:

Google works to provide the most accurate and up-to-date phishing and malware information. However, it cannot guarantee that its information is comprehensive and error-free: some risky sites may not be identified, and some safe sites may be identified in error.

Reporting incorrect data

If you would like to help us improve our data, you can submit reports to us. We also encourage you to allow your users to send reports directly to us by including these URLs in your product.

Report phishing URLs that are not currently on our list

http://www.google.com/safebrowsing/report_phish/? continue=http%3A%2F%2Fwww.google.com%2Ftools%2Ffirefox%2Ftoolbar%2FFT2%2Fintl%2F%3Clang%3E%2Fsubmit_success.html&hl=en

Report URLs that are currently on our phishing list in error:

http://www.google.com/safebrowsing/report_error/?continue=http%3A%2F%2Fwww.google.com%2Ftools%2Ffirefox% 2Ftoolbar%2FFT2%2Fintl%2F%3Clang%3E%2Fsubmit_success.html&hl=en

Report malware URLs that are not currently on our malware list

http://www.google.com/safebrowsing/report_badware/

Report URLs that are currently on our malware list in error:

http://www.stopbadware.org/home/reviewinfo

List Format

Both goog-black-hash and goog-malware-hash consist of a list of md5 hashes. As mentioned above, the wire format for each list consists of a tab-separated key/value pairs. For these lists the key is the hash and the value is always blank. The hashed values for both lists are suffix/prefix expressions.

Suffix/Prefix Expressions

A suffix/prefix expression consists of a host suffix (or full host) and a path prefix (or full path). Note that the path prefix consists of full path components, not partial path components.

Examples:

Regular expression:	http\:\/\/.\.a\.b\/mypath\/.
Suffix/Prefix expression:	a.b/mypath/
Regular expression:	http\:\/\/.*.c\.d\/full\/path\.html
Suffix/Prefix expression:	c.d/full/path.html

Performing Lookups

Currently all valid list types rely on suffix/prefix expressions, as described above. To perform a lookup for a given URL, the client will try to form different possible host suffix and path prefix combinations and see if they match each list. For these lookups, only the host and path components of the URL are used. The scheme, username, password, and port are disregarded. If query parameters are present in the URL, the client will also include a lookup with the full path and query parameters.

For the hostname, the client will try at most 5 different strings. They are:

the exact hostname in the URL
up to 4 hostnames formed by starting with the last 5 components and successively removing the leading component. The top-level domain can be skipped.

For the path, the client will also try at most 6 different strings. They are:

the exact path of the url, including query parameters
the exact path of the url, without query parameters
the 4 paths formed by starting at the root (/) and successively appending path components, including a trailing slash.

The following examples should help illustrate the lookup behavior:

For the url http://a.b.c/1/2.html, the client will try these possible strings:

a.b.c/1/2.html?param=1
a.b.c/1/2.html
a.b.c/
a.b.c/1/
b.c/1/2.html?param=1
b.c/1/2.html
b.c/
b.c/1/

For the url http://a.b.c.d.e.f.g/1.html, the client will try these possible strings:

a.b.c.d.e.f.g/1.html
a.b.c.d.e.f.g/
c.d.e.f.g/1.html
c.d.e.f.g/
d.e.f.g/1.html
d.e.f.g/
e.f.g/1.html
e.f.g/
f.g/1.html
f.g/

*(Note that b.c.d.e.f.g, is skipped since we'll take only the last 5 hostname components, and the full hostname)

Canonicalization

Before lookup in any list, the URL must be canonicalized.

We assume that the client has parsed the URL and made it valid according to RFC 2396. If it's an international URL, use the ASCII punycode representation. The URL must include a path component, e.g. 'http://google.com/' must have a trailing slash.

To start, repeatedly URL-unescape the URL until it has no more hex-encodings.

To canonicalize the hostname, extract the hostname from the URL and then follow these steps:

Remove all leading and trailing dots
Replace consecutive dots with a single dot.
If the hostname can be parsed as an IP address, it should be normalized to 4 dot-separated decimal values. The client should handle any legal IP address encoding, including octal, hex, and fewer than 4 components.
Lowercase the whole string.

To canonicalize the path:

The sequences "/../" and "/./" in the path should be resolved, by replacing "/./" with "/", and removing "/../" along with the preceding path component.
Runs of consecutive slashes should be replaced with a single slash character.

After performing these steps, percent-escape all characters in the URL which are <= ASCII 32, >= 127, or "%". The escapes should use uppercase hex characters.


e.g. "ajax apis" or "open source"

Google Safe Browsing API

Safe Browsing API

Developer's Guide

Contents

Audience

Overview

Getting Started

GetKey Requests (optional)

Description

Server Response

Update Requests

Description

Parameters

Server Response

Message Authentication Code for Updates

Acceptable Usage in Clients

Client Update Frequency and Back-off

Throttling Clients

End-user visible warnings

Suggested phishing warning language

Notice to Users About Phishing and Malware Protection

Reporting incorrect data

Report phishing URLs that are not currently on our list

Report URLs that are currently on our phishing list in error:

Report malware URLs that are not currently on our malware list

Report URLs that are currently on our malware list in error:

List Format

Suffix/Prefix Expressions

Performing Lookups

Canonicalization