The Safe Browsing API is an experimental API that allows client applications to check URLs against Google's constantly-updated blacklists of suspected phishing and malware pages. Your client application can use the API to download an encrypted table for local, client-side lookups of URLs.
In addition to providing some background on the capabilities of the Safe Browsing API, this document provides examples for interacting with the API by sending HTTP messages to download lists or perform lookups. This document also includes information and examples on how to perform client-side lookups from a downloaded list.
This document is intended for programmers who want to use Google anti-phishing and anti-malware data to protect users from potentially malicious websites. It provides a series of examples of basic data API interactions, a set of guidelines for how the data may be accessed, and information on how the data can be used. Please note that the data format is likely to change in the future.
Google publishes phishing and malware data in 2 separate blacklists (goog-black-hash and goog-malware-hash). Each is a list of md5 hash values. The client should keep a local copy of the lists and consult them for every URL the user visits. The client should store the lists as it receives them and make no attempt to convert a hashed list to plaintext. Clients should optimally update their lists every 30 minutes by contacting Google to check for new data. Your application is not permitted to show warnings to end users unless it has requested an update in the last 30 minutes without receiving an error response. Also, if you do show warnings to end users, you must adhere to Google's guidelines for the warning text and provide appropriate attribution as discussed below under End User Visible Warnings.
We will limit the number of users you can support with a single API key. If you expect to have more than 10,000 users sending regular requests to the API, you must contact us by sending email to antiphish-malware-cap-req@google.com.
Lists are versioned with a major and minor number. The major version is currently 1, and is used to describe the wire format for serializing the lists (see below). The minor number indicates the version of the list. When we add or remove items from a list, we increment its minor version number.
The client should record which lists it knows about and their version
numbers (-1 for no existing version). To request an update from the update
server, the client should use GET with the desired list names and version
numbers as a query parameter: version=type:major:minor[,type:major:minor]*
. For example:
http://sb.google.com/safebrowsing/update?client=api&apikey=<yourkey>&version=goog-black-hash:1:432,goog-malware-hash:1:-1
The server responds with updates to all lists in the wire format. For each list the response includes either a completely new list or a diff between the client's version of the list and the most current version, whichever is smaller. If there is no change to a list since the last update, the server will not include it in the response.
First you need to request an API key, which will authenticate you as an API user. In order to obtain an API key, you must have a Google account. You may create a Google account or log in with your existing Google account and sign up for the API at http://www.google.com/safebrowsing/key_signup.html
Please note that if you violate the requirements detailed in the Acceptable Usage in Clients section, your key may be disabled for a period of time.
Next you should decide whether you wish to use a Message Authentication Code
(MAC) for your list updates. A MAC allows you to verify the integrity of the
data that you receive from the server. If you choose to use a MAC, you
must first make a
getkey
request as described below. Otherwise, you only need to use the
update request.
The getkey
request may be used at client startup to create a shared secret key between
the client and the server. The secret key is optional and can be used to
authenticate list updates. To be secure, the
getkey
request uses SSL.
This is the url for the
getkey
request:
https://sb-ssl.google.com/safebrowsing/getkey?client=api
The server responds with key-value pairs, in this format:
key:<value length>:value
In this case, the server will respond with a clientkey and wrappedkey.
For example:
clientkey:24:pOAblTUiZFkLSv3xRiXKKQ==
wrappedkey:24:MTqdJvrixHRGAyfebvaQWYda
The client key is a 16-byte, base-64 encoded random nonce, generated by the
server when receiving the GetKey request. The wrapped key is the random
nonce encrypted by a server key. The wrappedkey is opaque to the client and
a server may implement any encryption algorithm it sees fit. The wrappedkey
allows the server to reconstruct the client key without requiring per-client
state. It is up to the server to include verification information into the
wrapped key that might allow it to determine if decrypting it was
successful. If the server key changes, the server can prepend pleaserekey
to responses to tell the client to request a new client key.
The GetKey request should only be called once per client, as well as once per
pleaserekey
response.
The update
request is used to ask for updates to the phishing or malware data. The client
should provide the lists that it wants the server to update. The server either
provides the full content of the current lists or incremental updates to bring
the client's lists up to the current version.
This is the url for the
update
request:
http://sb.google.com/safebrowsing/update?client=api
The update
request takes three required
parameters
(client=api
,
apikey
and
version
)
and one optional parameter
(wrkey
).
The apikey
parameter should be
the key you received by signing up for API usage (see the Getting
Started section). The
version
parameter specifies the lists and versions that the client has, i.e.
"version=goog-black-hash:1:432,goog-malware-hash:1:32". The optional
wrkey
parameter should be the wrapped
key sent by the server in response to a
getkey
request
and should only be used if the client wishes to receive a MAC. See the
GetKey Request
and
Message Authentication Code for Updates
sections for more information on the MAC.
The serialized form of the lists is called the wire format. This is the form of update responses and is a simple line-oriented protocol. Clients should ignore any malformed lines.
It consists of a sequence of sections each consisting of a header line such
as [type major.minor
[update]][[mac=<digest>]]
followed by lines of data comprising
the list described by the header. If the
update
token
appears in the header line, the data following constitutes an update to the
client's existing list. If not, the data specifies a full, new list and the
client should discard any old data for that list. If the client provided a
wrappedkey
in
the request, the response must include the MAC. Here are a few possible
first-line header responses:
[goog-black-hash 1.372 update] [goog-black-hash 1.372] [goog-malware-hash 1.10][mac=iA5vLUidpXAPwfcAH9+8OQ==] [goog-malware-hash 1.10 update][mac=iA5vLUidpXAPwfcAH9+8OQ==]
The header line will be followed by data lines which begin with a + or -. A plus indicates an addition to the table and is followed by a tab-separated key/value pair. A minus means to remove a key from the table and is followed by the key itself.
An example update response is:
[goog-black-hash 1.419 update] -5a76ceebafdc7b72883e5c8212d0b046 +76fa3d25e1dd28913ff829143fec7aa3 +a1b2324852d1368fbe14df5920881a08 -b3c780524ac86cdfe51fe6709c49e8a6 ... [goog-malware-hash 1.201] +000a8a2973c056d87ac25a6900f3a720 +01faf1b9baf0b4f3284cc3f56b9bafb7 +2f0cf74439cffbe0c988b034f6dd46ba ...
In this example, the client has some version of
goog-black-hash
prior to 419 and the server is telling the client to bring itself up to
version 419 by applying the adds and deletes that follow. The second header
indicates that a full, new table is replacing
goog-malware-hash
.
This suggests that the client has some version of
goog-malware-hash
earlier than version 201, but the diff would be longer than the entire
version 201 table, so it is sending a complete replacement.
If the client provided a wrapped key in the update request, the server also
needs to compute a MAC for the response data that the client can use to
verify the integrity of the lists. The MAC is computed from an MD5 Digest
over the following information: client_key|separator|table
data|separator|client_key. The separator is the
string:coolgoog:
- that is a colon followed by "coolgoog" followed by a colon. The resulting
128-bit MD5 digest is websafe base-64 encoded and provided via
[mac=<encoded digest>] on the first line of a table update; see below.
Here is an example that you can use to verify the MAC algorithm in your server:
client key: "8eirwN1kTwCzgWA2HxTaRQ=="
A sample query to get table data using a MAC looks like this:
http://sb.google.com/safebrowsing/update?client=api&apikey=<yourkey>&version=goog-black- hash:1:179&wrkey=MTrILkq3LDJtWp8V8zHJaJc2
Below is a sample response including a correct MAC based on the keys provided to the server above.
[goog-black-hash 1.180 update][mac=dRalfTU+bXwUhlk0NCGJtQ==] +8070465bdf3b9c6ad6a89c32e8162ef1 +86fa593a025714f89d6bc8c9c5a191ac +bbbd7247731cbb7ec1b3a5814ed4bc9d
*Note that there are tabs at the end of each line.
If this API is used in a client-side application there are some requirements to follow when asking the server for updates.
The first update request should happen at a random interval between 0-5 minutes after the client starts. The second update request should happen between 15-30 minutes later. After that, each update should occur once every 25-30 minutes. Be sure that your update request includes the correct last version of the list so that the server can provide incremental updates rather then the entire list.
Providing the data on the server for update requires significant resources. To help maintain a high quality of service, it may be necessary for the update server to ask the client to make less frequent requests. To handle this, the client must watch for HTTP timeouts or errors from the server and if too many errors occur, it should increase the time between requests. In particular, if the client receives an error during update, it should try again in one minute. If it receives three errors in a row, it should skip updates until at least 60 minutes have passed before trying again. If it then receives another (fourth) error, it should skip updates for the next 180 minutes and if it receives another (fifth) error, it should skip updates for the next 360 minutes. It may continue to check once every 360 minutes until the server responds with a success message. Once the server sends a successful HTTP reply, the client should reset it's error counter. If the client has not received a successful response from the server in the last 30 minutes, the client should not show a warning to end-users.
The GetKey
request should only be called once per client to establish the secret key.
Subsequent
GetKey
requests should only be issued in response to a
pleaserekey
message.
If there is a large spike in traffic such that exceeds allocated capacity, clients may be throttled. When a client is throttled, the server will pretend that there are no new updates for some or all of the lists that the client requested. In other words, there is no special protocol for throttling, but a client may receive the data late if it is throttled. In this case, the client will not be able to tell that the data is stale so it may continue to show warnings.
If you use the Google Safe Browsing API to warn users about risks from particular webpages, we require that you follow certain guidelines. These guidelines help protect both you and Google from misunderstandings by making clear that the page is not known with 100% certainty to be a phishing site or a distributor of malware, and that the warnings merely identify possible risk.
We encourage you to just copy this warning language in your product, or modify it slightly to fit your product.
Warning- Suspected phishing page. This page may be a forgery or imitation of another website, designed to trick users into sharing personal or financial information. Entering any personal information on this page may result in identity theft or other abuse. You can find out more about phishing from www.antiphishing.org.
Warning- Visiting this web site may harm your computer. This page appears to contain malicious code that could be downloaded to your computer without your consent. You can learn more about harmful web content including viruses and other malicious code and how to protect your computer at StopBadware.org.
Our Terms of Service require that if you indicate to users that your service provides malware or phishing protection, you must also let them know that the protection is not perfect. This notice must be visible to them before they enable the protection, and it must let them know that there is a chance of both false positives (safe sites flagged as risky) and false negatives (risky sites not flagged). We suggest using the following language:
If you would like to help us improve our data, you can submit reports to us. We also encourage you to allow your users to send reports directly to us by including these URLs in your product.
http://www.google.com/safebrowsing/report_badware/
http://www.stopbadware.org/home/reviewinfo
Both goog-black-hash and goog-malware-hash consist of a list of md5 hashes. As mentioned above, the wire format for each list consists of a tab-separated key/value pairs. For these lists the key is the hash and the value is always blank. The hashed values for both lists are suffix/prefix expressions.
A suffix/prefix expression consists of a host suffix (or full host) and a path prefix (or full path). Note that the path prefix consists of full path components, not partial path components.
Examples:
Regular expression: | http\:\/\/.*\.a\.b\/mypath\/.* |
Suffix/Prefix expression: | a.b/mypath/ |
Regular expression: | http\:\/\/.*.c\.d\/full\/path\.html |
Suffix/Prefix expression: | c.d/full/path.html |
Currently all valid list types rely on suffix/prefix expressions, as described above. To perform a lookup for a given URL, the client will try to form different possible host suffix and path prefix combinations and see if they match each list. For these lookups, only the host and path components of the URL are used. The scheme, username, password, and port are disregarded. If query parameters are present in the URL, the client will also include a lookup with the full path and query parameters.
For the hostname, the client will try at most 5 different strings. They are:
For the path, the client will also try at most 6 different strings. They are:
The following examples should help illustrate the lookup behavior:
For the url http://a.b.c/1/2.html,
the client will try these possible
strings:
For the url
http://a.b.c.d.e.f.g/1.html, the client will try these possible
strings:
Before lookup in any list, the URL must be canonicalized.
We assume that the client has parsed the URL and made it valid according to RFC 2396. If it's an international URL, use the ASCII punycode representation. The URL must include a path component, e.g. 'http://google.com/' must have a trailing slash.
To start, repeatedly URL-unescape the URL until it has no more hex-encodings.
To canonicalize the hostname, extract the hostname from the URL and then follow these steps:
To canonicalize the path:
After performing these steps, percent-escape all characters in the URL which are <= ASCII 32, >= 127, or "%". The escapes should use uppercase hex characters.