Query Reference

BigQuery queries are written using a variation of the standard SQL SELECT statement. BigQuery supports a wide variety of functions such as COUNT, arithmetic expressions, and string functions. This document details BigQuery's query syntax and functions.

Query syntax

All BigQuery queries are SELECT statements of this form:

SELECT expr1 [[AS] alias1] [, expr2 [[AS] alias2], ...]
    [agg_function(expr3) WITHIN expr4]
    [FROM [(FLATTEN(table_name1|(subselect1)] [, table_name2|(subselect2), ...)]
    [([INNER]|LEFT OUTER) JOIN table_2|(subselect2) [[AS] tablealias2]
      ON join_condition_1 [... AND join_condition_N ...]]
    [WHERE condition]
    [GROUP BY field1|alias1 [, field2|alias2, ...]]
    [HAVING condition]
    [ORDER BY field1|alias1 [DESC|ASC] [, field2|alias2 [DESC|ASC], ...]]
    [LIMIT n]
    ;

Note: Keywords are not case-sensitive. In this document, keywords such as SELECT are capitalized for illustration purposes.

SELECT clause

SELECT expr1 [AS alias1], expr2 [AS alias2], ...

The SELECT clause specifies the set of values to be returned by a query. Expressions (expr1, etc.) in the SELECT clause can be field names, literals, or functional expressions that operate on fields or literals. Expressions must be comma-separated.

The SELECT clause supports an AS section, which defines an alias for the field, literal, or functional expression. You can refer to an alias only in GROUP BY and ORDER BY clauses.

Notes:

If you use an aggregation function on any result, such as COUNT, you must use the GROUP BY clause to group all non-aggregated fields. For example:

SELECT word, corpus, COUNT(word)
  FROM publicdata:samples.shakespeare
  WHERE word CONTAINS "th" GROUP BY word, corpus; // Succeeds

SELECT word, corpus, COUNT(word)
  FROM publicdata:samples.shakespeare
  WHERE word CONTAINS "th" GROUP BY word; // INVALID! Didn't group by corpus

Example
This example defines an alias for the sorting clause:

SELECT word, LENGTH(word) AS len
  FROM publicdata:samples.shakespeare
  WHERE word CONTAINS 'th' ORDER BY len;

WITHIN clause

The WITHIN clause can be used as a SELECT field with the following syntax:

SELECT expr1 [WITHIN RECORD|node] [AS alias1], expr2 [WITHIN RECORD|node] [AS alias2], ...

The WITHIN keyword specifically works with aggregate functions to aggregate across children and repeated fields within records and nested fields.

When you specify the WITHIN keyword, you need to specify the scope over which you want to aggregate:

WITHIN RECORD: Aggregates data in the repeated values within the record, where a record represents an entire unique protocol buffer message, including children and repeated values.
WITHIN node_name: Aggregates data in the repeated values within the specified node, where a node is a parent node of the field in the aggregation function.

For more information and examples, see WITHIN in the developer's guide.

FROM clause

... FROM [project_name:]datasetId.tableId|(subselect_clause) [[AS] alias] ...

project_name is an optional name of the project that contains the dataset. If not specified, the query will assume it is in the current project. Note: if your project name includes a dash, you must surround the entire table reference with brackets, as in [my-dashed-project:dataset1.tableName].

subselect_clause is a nested SELECT clause. The subquery is evaluated and subquery results are treated just like a table. The result of the subquery must contain the appropriate columns and data required by the containing SELECT statement. If the query includes multiple subqueries or a combination or subqueries and tables, all of the tables and subqueries must contain all of the fields in the SELECT clause of the main query. The rules that apply to queries also apply to subqueries, except that subqueries do not end in a semicolon.

alias is primarily used in JOIN statements, where you may need to provide a name to a subselect clause. You should only refer to a field as alias.field in order to disambiguate between the two tables involved in a JOIN.

The table name must be qualified with the dataset and project IDs, unless you specify default dataset or project IDs as part of the query request.

Example

SELECT COUNT(*)
  FROM publicdata:samples.shakespeare
  WHERE word CONTAINS "th";

Note: Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins. This means you can run a query over several tables with compatible schemas as follows:

// Find suspicious activity over several days
SELECT FORMAT_UTC_USEC(event.timestamp_in_usec) AS time, request_url
  FROM [applogs.events_20120501], [applogs.events_20120502], [applogs.events_20120503]
  WHERE event.username = 'root' AND NOT event.source_ip.is_internal;

FLATTEN clause

...(FLATTEN ([project_name:]datasetId.tableId, flattenField))..
...(FLATTEN (subselect_clause, flattenField))..

FLATTEN converts a repeated field into an optional field. Given one record with many values for a repeated field, FLATTEN unrolls it into many records, one record for each value of the (formerly) repeated field; any non-repeated fields become duplicated to fill out each of the new records formed. FLATTEN removes one level of nesting.

For more information and examples, see FLATTEN in the developer's guide.

JOIN clause

SELECT field_1 [..., field_n] FROM
   table_1 [[AS] alias_1]
 ([INNER]|LEFT OUTER) JOIN [EACH]
   table_2 [[AS] alias_2]
 ON join_condition_1 [... AND join_condition_n]

BigQuery supports two types of JOIN operations:

JOIN requires that the right-side table contains less than 8 MB of compressed data.
JOIN EACH allows join queries for tables of any size.

When possible, use JOIN without the EACH modifier for best performance. Use JOIN EACH when table sizes are too large for JOIN.

BigQuery supports INNER and LEFT OUTER joins. The default is INNER.

Example

// Simple JOIN of two tables
SELECT table1.id, table2.username
FROM   table1
JOIN   table2
       ON table1.name = table2.name AND table1.id = table2.customer_id;

WHERE clause

... WHERE condition ...

The WHERE clause, sometimes called the predicate, states the qualifying conditions for a query. Multiple conditions can be joined by boolean AND and OR clauses, optionally surrounded by (parentheses) to group them. The fields listed in a WHERE clause do not need to be listed in any SELECT clause.

Note: Aggregate functions cannot be used in the WHERE clause. Use HAVING if you need to use aggregate fields.

Example
The following example includes two clauses joined by an OR; either one must be satisfied for the row to be returned.

select word FROM publicdata:samples.shakespeare
  WHERE
    (word CONTAINS 'prais' AND word CONTAINS 'ing') OR
    (word CONTAINS 'laugh' AND word CONTAINS 'ed');

GROUP BY clause

... GROUP [EACH] BY field1|alias1, field2|alias2 ...

The GROUP BY clause allows you to group rows that have the same values for a given field. You can then perform aggregate functions on each of the groups. Grouping occurs after any selection or aggregation in the SELECT clause.

For example you can group the rows that have the same value for f1 and find the SUM of the f2 values for each group:

SELECT f1, SUM(f2)
  FROM ds.Table
  GROUP BY f1;

This type of aggregation is called group aggregation. Unlike scoped aggregation, group aggregation is supported by traditional relational databases.

The EACH parameter can be used when your dataset contains a large number of distinct values for the group keys. Use GROUP BY without the EACH parameter when possible to improve query performance.

Notes:

Fields in the GROUP BY clause must be listed in the SELECT clause:

SELECT f1, COUNT(f2) FROM ds.Table GROUP BY f1; // OK
SELECT f1, COUNT(f2) FROM ds.Table GROUP BY f1, f3; // INVALID! f3 not in SELECT clause

Non-aggregated fields in the SELECT clause must be listed in the GROUP BY clause:

SELECT f1, COUNT(f2), f3 FROM ds.Table GROUP BY f1; // INVALID! f3 not in GROUP BY clause

You cannot group by an aggregate value:

SELECT f1, COUNT(f2) as count FROM ds.Table GROUP BY count; // INVALID! Cannot group by aggregate

Grouping by float and double values is not supported, because the equality function for those types is not well-defined.
Because the system is interactive, queries that produce a large number of groups might fail. The use of the TOP function instead of GROUP BY might solve the problem.

HAVING clause

... HAVING condition ...

The HAVING clause states the qualifying conditions for a query. Multiple conditions can be joined by boolean AND and OR clauses, optionally surrounded by (parentheses) to group them. HAVING is similar to WHERE, but it supports aggregate fields.

Note that the HAVING clause can only refer to fields defined in your SELECT clause (if the field has an alias, you must use it; if it doesn't, use the aggregate field name instead).

Example


SELECT keyword, SUM(clicks)/SUM(impressions) ctr FROM ads
  WHERE impressions > 100
  GROUP BY keyword
  HAVING ctr > 0.1;

SELECT foo, SUM(bar) as boo FROM myTable GROUP BY foo HAVING boo>0;

ORDER BY clause

... ORDER BY field1|alias1 [DESC|ASC], field2|alias2 [DESC|ASC] ...

The ORDER BY clause sorts the results of a query in ascending or descending order of one or more fields. Use DESC (descending) or ASC (ascending) to specify the sort direction. ASC is the default.

You can sort by field names or by aliases from the SELECT clause. To sort by multiple fields or aliases, enter them as a comma-separated list. The results are sorted on the fields in the order in which they are listed.

LIMIT clause

... LIMIT num_rows ...

The LIMIT clause limits the number of rows in the returned result set. For example, the following query returns only 5 results:

SELECT COUNT(*), word FROM ds.Table WHERE word CONTAINS 'th' GROUP_BY word LIMIT 5;

Since a query can operate over very large number of rows, LIMIT is a good way to avoid long-running queries for simple queries where representative data is sufficient.

Notes:

The LIMIT clause will stop processing and return results when it satisfies your requirements. This can reduce processing time for some queries, but when you specify aggregate functions such as COUNT or ORDER BY clauses, the full result set must still be processed before returning results.
The same LIMIT clause can return different results when it shortcuts processing the full result set. This is because queries are handled by parallel processing, and the order in which parallel jobs return is not guaranteed.
The LIMIT clause cannot contain any functions; it takes only numeric constants.

Supported functions and operators

Most SELECT statement clauses support functions. Fields referenced in a function don't need to be listed in any SELECT clause. Therefore, the following query is valid, even though the clicks field is not displayed directly: SELECT country, SUM(clicks) FROM table GROUP BY country;

Aggregate functions

Aggregate functions return values that represent summaries of larger sets of data, which makes these functions particularly useful for analyzing logs. An aggregate function operates against a collection of values and returns a single value per table, group, or scope:

Table aggregation
Uses an aggregate function to summarize all qualifying rows in the table. For example:

SELECT COUNT(f1) FROM ds.Table;
Group aggregation
Uses an aggregate function and a GROUP BY clause that specifies a non-aggregated field to summarize rows by group. For example:

SELECT COUNT(f1) FROM ds.Table GROUP BY b1;

The TOP function represents a specialized case of group aggregation.
Scoped aggregation
This feature applies only to tables that have nested fields.
Uses an aggregate function and the WITHIN keyword to aggregate repeated values within a defined scope. For example:

SELECT COUNT(m1.f2) WITHIN RECORD FROM Table;

The scope can be RECORD, which corresponds to entire row, or a node (repeated field in a row). Aggregation functions operate over the values within the scope and return aggregated results for each record or node.

Note: Aggregate functions cannot be used directly in a WHERE clause. However, you can use an alias in a WHERE clause. For example:

If you want to apply a restriction to an aggregate function, use a subselect query, for example:

SELECT x, y, z, count(*) as cnt FROM foo WHERE cnt > 100   // Succeeds
SELECT x, y, z  FROM foo WHERE count(*)                    // Invalid

Syntax

Function	Description
`AVG(numeric_expr)`	Returns the average of the values for a group of rows computed by `numeric_expr`. Rows with a NULL value are not included in the calculation.
`CORR(numeric_expr, numeric_expr)`	Returns the Pearson correlation coefficient of a set of number pairs.
`COUNT(*)`	Returns the total number of values (NULL and non-NULL) in the scope of the function. Unless you are using `COUNT(*)` with the `TOP` function, it is better to explicitly specify the field to count.
`COUNT([DISTINCT] field [, n])`	Returns the total number of non-NULL values in the scope of the function. If you use the `DISTINCT` keyword, the function returns the number of unique values in the specified field. Note that the returned value for `DISTINCT` is a statistical estimate, not necessarily the exact value. When you use the `DISTINCT` keyword, you can also specify a value for `n` (`n > 0`). If the count is greater than `n`, the function returns a statistical approximation. If the count is less than `n`, the function returns an exact result. The default value for `n` is 1000. Large values for `n` might substantially increase query execution time. In general, you should use `COUNT([DISTINCT] field [, n])` instead of `COUNT(*)` because specifying the field removes confusion over what you are counting when repeated fields are present.
`COVAR_POP(numeric_expr1, numeric_expr2)`	Computes the population covariance of the values computed by `numeric_expr1` and `numeric_expr2`.
`COVAR_SAMP(numeric_expr1, numeric_expr2)`	Computes the sample covariance of the values computed by `numeric_expr1` and `numeric_expr2`.
`GROUP_CONCAT('str')`	Concatenates multiple strings into a single comma-delimited string, rather like SUM() for strings. Use this with a grouping statement and a field name for str to concatenate a list of all string values in a group into a single string. Example: To return a list of wikipedia articles with a column that is a concatenated list of all contributors to that article, run `SELECT title, GROUP_CONCAT(contributor_username) AS editors FROM publicdata:samples.wikipedia WHERE wp_namespace=1 AND title CONTAINS "Cheese" GROUP BY title;`
`LAST(field)`	Returns the last sequential value in the scope of the function. The `LAST` function works only with the `WITHIN` keyword.
`MAX(field)`	Returns the maximum value in the scope of the function.
`MIN(field)`	Returns the minimum value in the scope of the function.
`NEST(expr)`	Aggregates all values in the current aggregation scope into a repeated field. For example, the query `"SELECT x, NEST(y) FROM ... GROUP BY x"` returns one output record for each distinct `x` value, and contains a repeated field for all `y` values paired with `x` in the query input. The `NEST` function requires a `GROUP BY` clause. BigQuery automatically flattens query results, so if you use the `NEST` function on the top level query, the results won't contain repeated fields. Use the `NEST` function when using a subselect that produces intermediate results for immediate use by the same query.
`NTH(n, field)`	Returns the `n`th sequential value in the scope of the function, where `n` is a constant. The `NTH` function starts counting at 1, so there is no zeroth term. If the scope of the function has less than `n` values, the function returns `NULL`.
`QUANTILES(expr[, buckets])`	Compares approximate quantiles for the input expression. The number of quantiles computed is controlled with the optional `buckets` parameter. The default value of `buckets` is 100. If specified explicitly, `buckets` must be greater than or equal to 2. The fractional error per quantile is epsilon = 1 / `buckets`.
`STDDEV(numeric_expr)`	Returns the standard deviation of the values computed by `numeric_expr`. Rows with a NULL value are not included in the calculation. The `STDDEV` function is an alias for `STDDEV_SAMP`.
`STDDEV_POP(numeric_expr)`	Computes the population standard deviation of the value computed by `numeric_expr`. For more information about population versus sample standard deviation, see Standard deviation on Wikipedia.
`STDDEV_SAMP(numeric_expr)`	Computes the sample standard deviation of the value computed by `numeric_expr`. For more information about population versus sample standard deviation, see Standard deviation on Wikipedia.
`SUM(field)`	Returns the sum total of the values in the scope of the function. For use with numerical data types only.
`TOP(field, [max_records], [multiplier])`	Returns the top max_records records by frequency. See the TOP description below for details.
`VARIANCE(numeric_expr)`	Computes the variance of the values computed by `numeric_expr`. Rows with a NULL value are not included in the calculation. The `VARIANCE` function is an alias for `VAR_SAMP`.
`VAR_POP(numeric_expr)`	Computes the population variance of the values computed by `numeric_expr`. For more information about population versus sample standard deviation, see Standard deviation on Wikipedia.
`VAR_SAMP(numeric_expr)`	Computes the sample variance of the values computed by `numeric_expr`. For more information about population versus sample standard deviation, see Standard deviation on Wikipedia.

TOP() function

TOP is a function that is an alternative to the GROUP BY clause. It is used as simplified syntax for GROUP BY ... ORDER BY ... LIMIT .... Generally, the TOP function performs faster than the full ... GROUP BY ... ORDER BY ... LIMIT ... query, but may only return approximate results. The following is the syntax for the TOP function:

TOP(field|alias[, max_values][,multiplier]) ... COUNT(*)

When using TOP in a SELECT clause, you must include COUNT(*) as one of the fields.

A query that uses the TOP() function can only return two fields: the TOP field, and the COUNT(*) value.

field|alias: The field or alias to return.
max_values: [Optional] The maximum number of results to return. Default is 20.
multiplier: A numeric constant, expression, or field that is multiplied with max_values to specify how many results to return.

Examples

SELECT TOP(word, 10), COUNT(*)
  FROM ds.Table WHERE word CONTAINS "th";

SELECT word, cnt, cnt * 2, left(word, 3)
  FROM (SELECT TOP(word, 10) AS word, COUNT(*) AS cnt
  FROM ds.Table WHERE word CONTAINS "th");

The following queries compare using TOP() versus using GROUP BY...ORDER BY...LIMIT. The query returns, in order, the top 10 most frequently used words containing "th", and the number of documents the words was used in. The TOP query will execute much faster:

SELECT word, COUNT(*) AS cnt FROM ds.Table
  WHERE word CONTAINS 'th' GROUP BY word ORDER BY cnt DESC LIMIT 10;

SELECT TOP(word, 10), COUNT(*) FROM ds.Table WHERE word contains 'th';

Note: You must include COUNT(*) in the SELECT clause to use TOP.

Advanced examples

Scenario Description Example

Average and standard deviation grouped by condition

Scenario	Description	Example
Average and standard deviation grouped by condition	The following query returns the average and standard deviation of birth weights in Ohio in 2003, grouped by mothers who do and do not smoke.	SELECT cigarette_use, /* Finds average and standard deviation / AVG(weight_pounds) baby_weight, STDDEV(weight_pounds) baby_weight_stdev, AVG(mother_age) mother_age FROM [publicdata:samples.natality] WHERE year=2003 AND state='OH' / Group the result values by those / / who smoked and those who didn't. */ GROUP BY cigarette_use;
Filter query results using an aggregated value	In order to filter query results using an aggregated value (for example, filtering by the value of a `SUM`), use the `HAVING` function. `HAVING` compares a value to a result determined by an aggregation function, as opposed to `WHERE`, which operates on each row prior to aggregation.	SELECT state, /* If 'is_male' is True, return 'Male', / / otherwise return 'Female' / IF (is_male, 'Male', 'Female') AS sex, / The count value is aliased as 'cnt' / / and used in the HAVING clause below. / COUNT() AS cnt FROM [publicdata:samples.natality] WHERE state != '' GROUP BY state, sex HAVING cnt > 3000000 ORDER BY cnt DESC Returns: +-------+--------+---------+ \| state \| sex \| cnt \| +-------+--------+---------+ \| CA \| Male \| 7060826 \| \| CA \| Female \| 6733288 \| \| TX \| Male \| 5107542 \| \| TX \| Female \| 4879247 \| \| NY \| Male \| 4442246 \| \| NY \| Female \| 4227891 \| \| IL \| Male \| 3089555 \| +-------+--------+---------+

The following query returns the average and standard deviation of birth weights in Ohio in 2003, grouped by mothers who do and do not smoke.

SELECT
  cigarette_use,
  /* Finds average and standard deviation */
  AVG(weight_pounds) baby_weight,
  STDDEV(weight_pounds) baby_weight_stdev,
  AVG(mother_age) mother_age
FROM
  [publicdata:samples.natality]
WHERE
  year=2003 AND state='OH'
/* Group the result values by those */
/* who smoked and those who didn't.  */
GROUP BY
  cigarette_use;

Filter query results using an aggregated value

In order to filter query results using an aggregated value (for example, filtering by the value of a SUM), use the HAVING function. HAVING compares a value to a result determined by an aggregation function, as opposed to WHERE, which operates on each row prior to aggregation.

SELECT
  state,
  /* If 'is_male' is True, return 'Male', */
  /* otherwise return 'Female' */
  IF (is_male, 'Male', 'Female') AS sex,
  /* The count value is aliased as 'cnt' */
  /* and used in the HAVING clause below. */
  COUNT(*) AS cnt
FROM
  [publicdata:samples.natality]
WHERE
  state != ''
GROUP BY
  state, sex
HAVING
  cnt > 3000000
ORDER BY
  cnt DESC

Returns:

+-------+--------+---------+
| state |  sex   |   cnt   |
+-------+--------+---------+
| CA    | Male   | 7060826 |
| CA    | Female | 6733288 |
| TX    | Male   | 5107542 |
| TX    | Female | 4879247 |
| NY    | Male   | 4442246 |
| NY    | Female | 4227891 |
| IL    | Male   | 3089555 |
+-------+--------+---------+

Arithmetic operators

Arithmetic operators take numeric arguments and return a numeric result. Each argument can be a numeric literal or a numeric value returned by a query. If the arithmetic operation evaluates to an undefined result, the operation returns NULL.

Syntax

Operator	Description	Example
+	Addition	`SELECT 6 + (5 - 1);` Returns: 10
-	Subtraction	`SELECT 6 - (4 + 1);` Returns: 1
*	Multiplication	`SELECT 6 * (5 - 1);` Returns: 24
/	Division	`SELECT 6 / (2 + 2);` Returns: 1.5
%	Modulo	`SELECT 6 % (2 + 2);` Returns: 2

Bitwise operators

Bitwise operators operate at the level of individual bits and require numerical arguments. For more information about bitwise operations, see Bitwise operation.

Syntax

Operator	Description	Example
&	Bitwise AND	`SELECT (1 + 3) & 1` Returns: 0
\|	Bitwise OR	`SELECT 24 \| 12` Returns: 28
^	Bitwise XOR	`SELECT 1 ^ 0` Returns: 1
<<	Bitwise shift left	`SELECT 1 << (2 + 2)` Returns: 16
>>	Bitwise shift right	`SELECT (6 + 2) >> 2` Returns: 2
~	Bitwise NOT	`SELECT ~2` Returns: -3

Casting functions

Casting functions change the data type of a numeric expression. Casting functions are particularly useful for ensuring that arguments in a comparison function have the same data type.

Syntax

Function	Description
`BOOLEAN(<numeric_expr>)`	Returns `true` if `<numeric_expr>` is not 0 and not NULL. Returns `false` if `<numeric_expr>` is 0. Returns `NULL` if `<numeric_expr>` is NULL.
`FLOAT(expr)`	Returns `expr` as a double. The `expr` can be a string like `'45.78'`, but the function returns `NULL` for non-numeric values.
`HEX_STRING(numeric_expr)`	Returns `numeric_expr` as a hexidecimal string.
`INTEGER(expr)`	Returns `expr` as a 64-bit integer. This function expects an integer string, such as `'45'`, and returns `NULL` for non-integer values.
`STRING(numeric_expr)`	Returns `numeric_expr` as a string.

Comparison functions

Comparison functions return true or false, based on the following types of comparisons:

A comparison of two expressions.
A comparison of an expression or set of expressions to a specific criteria, such as being in a specified list, being NULL, or being a non-default optional value.

You can use either numeric or string expressions as arguments for comparison functions. (String constants must be enclosed in single or double quotes.) The expressions can be literals or values fetched by a query. Comparison functions are most often used as filtering conditions in WHERE clauses, but they can be used in other clauses.

Syntax

Function	Description
`expr1 = expr2`	Returns `true` if the expressions are equal.
`expr1 != expr2` `expr1 <> expr2`	Returns `true` if the expressions are not equal.
`expr1 > expr2`	Returns `true` if `expr1` is greater than `expr2`.
`expr1 < expr2`	Returns `true` if `expr1` is less than `expr2`.
`expr1 >= expr2`	Returns `true` if `expr1` is greater than or equal to `expr2`.
`expr1 <= expr2`	Returns `true` if `expr1` is less than or equal to `expr2`.
`expr1 BETWEEN expr2 AND expr3`	Returns `true` if the value of `expr1` is greater than or equal to `expr2`, and less than or equal to `expr3`.
`expr IS NULL`	Returns `true` if `expr` is NULL.
`expr IN(expr1, expr2, ...)`	Returns `true` if `expr` matches `expr1`, `expr2`, or any value in the parentheses. The `IN` keyword is an efficient shorthand for `(expr = expr1 \|\| expr = expr2 \|\| ...)`. The expressions used with the `IN` keyword must be constants and they must match the data type of `expr`.
`IFNULL(expr, null_default)`	If expr is not null, returns expr, otherwise returns null_default.
`IS_INF(numeric_expr)`	Returns `true` if numeric_expr is positive or negative infinity.
`IS_NAN(numeric_expr)`	Returns `true` if numeric_expr is the special `NaN` numeric value.
`IS_EXPLICITLY_DEFINED(expr)`	Returns `true` for optional fields if the value is explicitly defined.

Date and time functions

The following functions enable date and time manipulation for UNIX timestamps, date strings and TIMESTAMP data types. For more information about working with the TIMESTAMP data type, see Using TIMESTAMP.

Date and time functions that work with UNIX timestamps operate on UNIX time. Date and time functions return values based upon the UTC time zone.

Syntax

Function	Description	Example
`CURRENT_DATE()`	Returns a human-readable string of the current date in the format `%Y-%m-%d`.	`SELECT CURRENT_DATE();` Returns: 2013-02-01
`CURRENT_TIME()`	Returns a human-readable string of the server's current time in the format `%H:%M:%S`.	`SELECT CURRENT_TIME();` Returns: 01:32:56
`CURRENT_TIMESTAMP()`	Returns a TIMESTAMP data type of the server's current time in the format `%Y-%m-%d %H:%M:%S`.	`SELECT CURRENT_TIMESTAMP();` Returns: 2013-02-01 01:33:35 UTC
`DATE(<timestamp>,<date_format>)`	Returns a human-readable string of a TIMESTAMP data type in the format `%Y-%m-%d`.	`SELECT DATE(TIMESTAMP('2012-10-01 02:03:04'));` Returns: 2012-10-01
`DATE_ADD(<timestamp>,<interval>, <interval_units>)`	Adds the specified interval to a TIMESTAMP data type. Possible `interval_units` values include `YEAR`, `MONTH`, `DAY`, `HOUR`, `MINUTE`, and `SECOND`. If `interval` is a negative number, the interval is subtracted from the TIMESTAMP data type.	`SELECT DATE_ADD(TIMESTAMP("2012-10-01 02:03:04"), 5, "YEAR");` Returns: 2017-10-01 02:03:04 UTC `SELECT DATE_ADD(TIMESTAMP("2012-10-01 02:03:04"), -5, "YEAR");` Returns: 2007-10-01 02:03:04 UTC
`DATEDIFF(<timestamp1>,<timestamp2>)`	Returns the number of days between two TIMESTAMP data types.	`SELECT DATEDIFF(TIMESTAMP('2012-10-02 05:23:48'), TIMESTAMP('2011-06-24 12:18:35'));` Returns: 466
`DAY(<timestamp>)`	Returns the day of the month of a TIMESTAMP data type as an integer between 1 and 31, inclusively.	`SELECT DAY(TIMESTAMP('2012-10-02 05:23:48'));` Returns: 2
`DAYOFWEEK(<timestamp>)`	Returns the day of the week of a TIMESTAMP data type as an integer between 1 and 7, inclusively.	`SELECT DAYOFWEEK(TIMESTAMP("2012-10-01 02:03:04"));` Returns: 2
`DAYOFYEAR(<timestamp>)`	Returns the day of the year of a TIMESTAMP data type as an integer between 1 and 365, inclusively.	`SELECT DAYOFYEAR(TIMESTAMP("2012-10-01 02:03:04"));` Returns: 275
`FORMAT_UTC_USEC(<unix_timestamp>)`	Returns a human-readable string representation of a UNIX timestamp in the format `YYYY-MM-DD HH:MM:SS.uuuuuu`.	`SELECT FORMAT_UTC_USEC(1274259481071200);` Returns: 2010-05-19 08:58:01.071200
`HOUR(<timestamp>)`	Returns the hour of a TIMESTAMP data type as an integer between 0 and 23, inclusively.	`SELECT HOUR(TIMESTAMP('2012-10-02 05:23:48'));` Returns: 5
`MINUTE(<timestamp>)`	Returns the minutes of a TIMESTAMP data type as an integer between 0 and 59, inclusively.	`SELECT MINUTE(TIMESTAMP('2012-10-02 05:23:48'));` Returns: 23
`MONTH(<timestamp>)`	Returns the month of a TIMESTAMP data type as an integer between 1 and 12, inclusively.	`SELECT MONTH(TIMESTAMP('2012-10-02 05:23:48'));` Returns: 10
`MSEC_TO_TIMESTAMP(<expr>)`	Converts a UNIX timestamp in milliseconds to a TIMESTAMP data type.	`SELECT MSEC_TO_TIMESTAMP(1349053323000);` Returns: 2012-10-01 01:02:03 UTC `SELECT MSEC_TO_TIMESTAMP(1349053323000 + 1000)` Returns: 2012-10-01 01:02:04 UTC
`NOW()`	Returns a UNIX timestamp in microseconds.	`SELECT NOW();` Returns: 1359685811687920
`PARSE_UTC_USEC(<date_string>)`	Converts a date string to a UNIX timestamp in microseconds. `date_string` must have the format `YYYY-MM-DD HH:MM:SS[.uuuuuu]`. The fractional part of the second can be up to 6 digits long or can be omitted. TIMESTAMP_TO_USEC is an equivalent function that converts a TIMESTAMP data type argument instead of a date string.	`SELECT PARSE_UTC_USEC("2012-10-01 02:03:04");` Returns: 1349056984000000
`QUARTER(<timestamp>)`	Returns the quarter of the year of a TIMESTAMP data type as an integer between 1 and 4, inclusively.	`SELECT QUARTER(TIMESTAMP("2012-10-01 02:03:04"));` Returns: 4
`SEC_TO_TIMESTAMP(<expr>)`	Converts a UNIX timestamp in seconds to a TIMESTAMP data type.	`SELECT SEC_TO_TIMESTAMP(1355968987);` Returns: 2012-12-20 02:03:07 UTC `SELECT SEC_TO_TIMESTAMP(INTEGER(1355968984 + 3));` Returns: 2012-12-20 02:03:07 UTC
`SECOND(<timestamp>)`	Returns the seconds of a TIMESTAMP data type as an integer between 0 and 59, inclusively. During a leap second, the integer range is between 0 and 60, inclusively.	`SELECT SECOND(TIMESTAMP('2012-10-02 05:23:48'));` Returns: 48
`STRFTIME_UTC_USEC(<unix_timestamp>, <date_format_str>)`	Returns a human-readable date string in the format date_format_str. date_format_str can include date-related punctuation characters (such as / and -) and special characters accepted by the strftime function in C++ (such as %d for day of month). Use the `UTC_USEC_TO_<function_name>` functions if you plan to group query data by time intervals, such as getting all data for a certain month, because the functions are more efficient.	`SELECT STRFTIME_UTC_USEC(1274259481071200, "%Y-%m-%d");` Returns: 2010-05-19
`TIME(<timestamp>)`	Returns a human-readable string of a TIMESTAMP data type, in the format `%H:%M:%S`.	`SELECT TIME(TIMESTAMP('2012-10-01 02:03:04'));` Returns: 02:03:04
`TIMESTAMP(<date_string>)`	Convert a date string to a TIMESTAMP data type.	`SELECT TIMESTAMP("2012-10-01 01:02:03");` Returns: 2012-10-01 01:02:03 UTC
`TIMESTAMP_TO_MSEC(<timestamp>)`	Converts a TIMESTAMP data type to a UNIX timestamp in milliseconds.	`SELECT TIMESTAMP_TO_MSEC(TIMESTAMP("2012-10-01 01:02:03"));` Returns: 1349053323000
`TIMESTAMP_TO_SEC(<timestamp>)`	Converts a TIMESTAMP data type to a UNIX timestamp in seconds.	`SELECT TIMESTAMP_TO_SEC(TIMESTAMP("2012-10-01 01:02:03"));` Returns: 1349053323
`TIMESTAMP_TO_USEC(<timestamp>)`	Converts a TIMESTAMP data type to a UNIX timestamp in microseconds. PARSE_UTC_USEC is an equivalent function that converts a data string argument instead of a TIMESTAMP data type.	`SELECT TIMESTAMP_TO_USEC(TIMESTAMP("2012-10-01 01:02:03"));` Returns: 1349053323000000
`USEC_TO_TIMESTAMP(<expr>)`	Converts a UNIX timestamp in microseconds to a TIMESTAMP data type.	`SELECT USEC_TO_TIMESTAMP(1349053323000000);` Returns: 2012-10-01 01:02:03 UTC `SELECT USEC_TO_TIMESTAMP(1349053323000000 + 1000000)` Returns: 2012-10-01 01:02:04 UTC
`UTC_USEC_TO_DAY(<unix_timestamp>)`	Shifts a UNIX timestamp in microseconds to the beginning of the day it occurs in. For example, if `unix_timestamp` occurs on May 19th at 08:58, this function returns a UNIX timestamp for May 19th at 00:00 (midnight).	`SELECT UTC_USEC_TO_DAY(1274259481071200);` Returns: 1274227200000000
`UTC_USEC_TO_HOUR(<unix_timestamp>)`	Shifts a UNIX timestamp in microseconds to the beginning of the hour it occurs in. For example, if `unix_timestamp` occurs at 08:58, this function returns a UNIX timestamp for 08:00 on the same day.	`SELECT UTC_USEC_TO_HOUR(1274259481071200);` Returns: 1274256000000000
`UTC_USEC_TO_MONTH(<unix_timestamp>)`	Shifts a UNIX timestamp in microseconds to the beginning of the month it occurs in. For example, if `unix_timestamp` occurs on March 19th, this function returns a UNIX timestamp for March 1st of the same year.	`SELECT UTC_USEC_TO_MONTH(1274259481071200);` Returns: 1272672000000000
`UTC_USEC_TO_WEEK(<unix_timestamp>, <day_of_week>)`	Returns a UNIX timestamp in microseconds that represents a day in the week of the `unix_timestamp` argument. This function takes two arguments: a UNIX timestamp in microseconds, and a day of the week from 0 (Sunday) to 6 (Saturday). For example, if `unix_timestamp` occurs on Friday, 2008-04-11, and you set `day_of_week` to 2 (Tuesday), the function returns a UNIX timestamp for Tuesday, 2008-04-08.	`SELECT UTC_USEC_TO_WEEK(1207929480000000, 2) AS tuesday;` Returns: 1207612800000000
`UTC_USEC_TO_YEAR(<unix_timestamp>)`	Returns a UNIX timestamp in microseconds that represents the year of the `unix_timestamp` argument. For example, if `unix_timestamp` occurs in 2010, the function returns `1274259481071200`, the microsecond representation of `2010-01-01 00:00`.	`SELECT UTC_USEC_TO_YEAR(1274259481071200);` Returns: 1262304000000000
`YEAR(<timestamp>)`	Returns the year of a TIMESTAMP data type.	`SELECT YEAR(TIMESTAMP('2012-10-02 05:23:48'));` Returns: 2012

Advanced examples

Scenario Description Example

Convert integer timestamp results into human-readable format

Scenario	Description	Example
Convert integer timestamp results into human-readable format	The following query finds the top 5 moments in time in which the most Wikipedia revisions took place. In order to display results in a human-readable format, use BigQuery's `FORMAT_UTC_USEC()` function, which takes a timestamp, in microseconds, as an input. This query multiplies the Wikipedia POSIX format timestamps (in seconds) by 1000000 to convert the value into microseconds.	SELECT /* Multiply timestamp by 1000000 and convert / / into a more human-readable format. / TOP (FORMAT_UTC_USEC(timestamp 1000000), 5) AS top_revision_time, COUNT () AS revision_count FROM [publicdata:samples.wikipedia]; Returns:* +----------------------------+----------------+ \| top_revision_time \| revision_count \| +----------------------------+----------------+ \| 2002-02-25 15:51:15.000000 \| 20971 \| \| 2002-02-25 15:43:11.000000 \| 15955 \| \| 2010-01-14 15:52:34.000000 \| 3 \| \| 2009-12-31 19:29:19.000000 \| 3 \| \| 2009-12-28 18:55:12.000000 \| 3 \| +----------------------------+----------------+
Bucketing Results by Timestamp	It's useful to use date and time functions to group query results into buckets corresponding to particular years, months, or days. The following example uses the `UTC_USEC_TO_MONTH()` function to display how many characters each Wikipedia contributor uses in their revision comments per month.	SELECT contributor_username, /* Return the timestamp shifted to the * start of the month, formatted in * a human-readable format. Uses the * 'LEFT()' string function to return only * the first 7 characters of the formatted timestamp. / LEFT (FORMAT_UTC_USEC( UTC_USEC_TO_MONTH(timestamp 1000000)),7) AS month, SUM(LENGTH(comment)) as total_chars_used FROM [publicdata:samples.wikipedia] WHERE (contributor_username != '' AND contributor_username IS NOT NULL) AND timestamp > 1133395200 AND timestamp < 1157068800 GROUP BY contributor_username, month ORDER BY total_chars_used DESC; Returns (truncated): +--------------------------------+---------+-----------------------+ \| contributor_username \| month \| total_chars_used \| +--------------------------------+---------+-----------------------+ \| Kingbotk \| 2006-08 \| 18015066 \| \| SmackBot \| 2006-03 \| 7838365 \| \| SmackBot \| 2006-05 \| 5148863 \| \| Tawkerbot2 \| 2006-05 \| 4434348 \| \| Cydebot \| 2006-06 \| 3380577 \| etc ...

The following query finds the top 5 moments in time in which the most Wikipedia revisions took place. In order to display results in a human-readable format, use BigQuery's FORMAT_UTC_USEC() function, which takes a timestamp, in microseconds, as an input. This query multiplies the Wikipedia POSIX format timestamps (in seconds) by 1000000 to convert the value into microseconds.

SELECT
  /* Multiply timestamp by 1000000 and convert */
  /* into a more human-readable format. */
  TOP (FORMAT_UTC_USEC(timestamp * 1000000), 5) 
    AS top_revision_time,
  COUNT (*) AS revision_count
FROM
  [publicdata:samples.wikipedia];

Returns:

+----------------------------+----------------+
|     top_revision_time      | revision_count |
+----------------------------+----------------+
| 2002-02-25 15:51:15.000000 |          20971 |
| 2002-02-25 15:43:11.000000 |          15955 |
| 2010-01-14 15:52:34.000000 |              3 |
| 2009-12-31 19:29:19.000000 |              3 |
| 2009-12-28 18:55:12.000000 |              3 |
+----------------------------+----------------+

Bucketing Results by Timestamp

It's useful to use date and time functions to group query results into buckets corresponding to particular years, months, or days. The following example uses the UTC_USEC_TO_MONTH() function to display how many characters each Wikipedia contributor uses in their revision comments per month.

SELECT
  contributor_username,
  /* Return the timestamp shifted to the
   * start of the month, formatted in
   * a human-readable format. Uses the
   * 'LEFT()' string function to return only
   * the first 7 characters of the formatted timestamp.
   */
  LEFT (FORMAT_UTC_USEC(
    UTC_USEC_TO_MONTH(timestamp * 1000000)),7) 
    AS month,
  SUM(LENGTH(comment)) as total_chars_used
FROM
  [publicdata:samples.wikipedia]
WHERE
  (contributor_username != '' AND 
   contributor_username IS NOT NULL)
  AND timestamp > 1133395200
  AND timestamp < 1157068800
GROUP BY
  contributor_username, month
ORDER BY
  total_chars_used DESC;

Returns (truncated):

+--------------------------------+---------+-----------------------+
|      contributor_username      |  month  | total_chars_used      |
+--------------------------------+---------+-----------------------+
| Kingbotk                       | 2006-08 |              18015066 |
| SmackBot                       | 2006-03 |               7838365 |
| SmackBot                       | 2006-05 |               5148863 |
| Tawkerbot2                     | 2006-05 |               4434348 |
| Cydebot                        | 2006-06 |               3380577 |
etc ...

IP functions

IP functions convert IP addresses to and from human-readable form.

Syntax

Function	Description
`FORMAT_IP(integer_value)`	Converts 32 least significant bits of `integer_value` to human-readable IPv4 address string. For example, `FORMAT_IP(1)` will return string `'0.0.0.1'`.
`PARSE_IP(readable_ip)`	Converts a string representing IPv4 address to unsigned integer value. For example, `PARSE_IP('0.0.0.1')` will return `1`. If string is not a valid IPv4 address, `PARSE_IP` will return `NULL`.

IPAddress supports writing IPv4 and IPv6 addresses in packed strings, as 4- or 16-byte binary data in network byte order. The functions described below supports parsing the addresses to and from human readable form. These functions work only on string fields with IPs.

Function syntax Description

Function syntax	Description
`FORMAT_PACKED_IP(packed_ip)`	Returns a human-readable IP address, in the form `10.1.5.23` or `2620:0:1009:1:216:36ff:feef:3f`. Examples: `FORMAT_PACKED_IP('0123456789@ABCDE')` returns `'3031:3233:3435:3637:3839:4041:4243:4445'` `FORMAT_PACKED_IP('0123')` returns `'48.49.50.51'`
`PARSE_PACKED_IP(readable_ip)`	Returns an IP address in binary format packed in a string. If the input string is not a valid IPv4 or IPv6 address, `PARSE_PACKED_IP` will return `NULL`. Examples: `PARSE_PACKED_IP('48.49.50.51')` returns `'0123'` `PARSE_PACKED_IP('3031:3233:3435:3637:3839:4041:4243:4445')` returns `'0123456789@ABCDE'`

FORMAT_PACKED_IP(packed_ip)

Returns a human-readable IP address, in the form 10.1.5.23 or 2620:0:1009:1:216:36ff:feef:3f. Examples:

FORMAT_PACKED_IP('0123456789@ABCDE') returns '3031:3233:3435:3637:3839:4041:4243:4445'
FORMAT_PACKED_IP('0123') returns '48.49.50.51'

PARSE_PACKED_IP(readable_ip)

Returns an IP address in binary format packed in a string. If the input string is not a valid IPv4 or IPv6 address, PARSE_PACKED_IP will return NULL. Examples:

PARSE_PACKED_IP('48.49.50.51') returns '0123'
PARSE_PACKED_IP('3031:3233:3435:3637:3839:4041:4243:4445') returns '0123456789@ABCDE'

Logical operators

Logical operators perform binary or ternary logic on expressions. Binary logic returns true or false. Ternary logic accommodates NULL values and returns true, false, or NULL.

Syntax

Operator Description

Operator	Description
`expr AND expr`	Returns `true` if both expressions are true. Returns `false` if one or both of the expressions are false. Returns `NULL` if both expressions are NULL or one expression is true and the other is NULL.
`expr OR expr`	Returns `true` if one or both expressions are true. Returns `false` if both expressions are false. Returns `NULL` if both expressions are NULL or one expression is false and the other is NULL.
`NOT expr`	Returns `true` if the expression is false. Returns `false` if the expression if true. Returns `NULL` if the expression is NULL. You can use `NOT` with other functions as an negation operator. For example, `NOT IN(expr1, expr2)` or `IS NOT NULL`.

expr AND expr

Returns true if both expressions are true.
Returns false if one or both of the expressions are false.
Returns NULL if both expressions are NULL or one expression is true and the other is NULL.

expr OR expr

Returns true if one or both expressions are true.
Returns false if both expressions are false.
Returns NULL if both expressions are NULL or one expression is false and the other is NULL.

NOT expr

Returns true if the expression is false.
Returns false if the expression if true.
Returns NULL if the expression is NULL.

You can use NOT with other functions as an negation operator. For example, NOT IN(expr1, expr2) or IS NOT NULL.

Mathematical functions

Mathematical functions take numeric arguments and return a numeric result. Each argument can be a numeric literal or a numeric value returned by a query. If the mathematical function evaluates to an undefined result, the operation returns NULL.

Syntax

Function	Description
`ABS(numeric_expr)`	Returns the absolute value of the argument.
`ACOS(numeric_expr)`	Returns the arc cosine of the argument.
`ACOSH(numeric_expr)`	Returns the arc hyperbolic cosine of the argument.
`ASIN(numeric_expr)`	Returns the arc sine of the argument.
`ASINH(numeric_expr)`	Returns the arc hyperbolic sine of the argument.
`ATAN(numeric_expr)`	Returns the arc tangent of the argument.
`ATANH(numeric_expr)`	Returns the arc hyperbolic tangent of the argument.
`ATAN2(numeric_expr1, numeric_expr2)`	Returns the arc tangent of the two arguments.
`CEIL(numeric_expr)`	Rounds the argument up to the nearest whole number and returns the rounded value.
`COS(numeric_expr)`	Returns the cosine of the argument.
`COSH(numeric_expr)`	Returns the hyperbolic cosine of the argument.
`DEGREES(numeric_expr)`	Returns `numeric_expr`, converted from radians to degrees.
`FLOOR(numeric_expr)`	Rounds the argument down to the nearest whole number and returns the rounded value.
`LN(numeric_expr)` `LOG(numeric_expr)`	Returns the natural logarithm of the argument.
`LOG2(numeric_expr)`	Returns the Base-2 logarithm of the argument.
`LOG10(numeric_expr)`	Returns the Base-10 logarithm of the argument.
`PI()`	Returns the constant π. The `PI()` function requires parentheses to signify that it is a function, but takes no arguments in those parentheses. You can use `PI()` like a constant with mathematical and arithmetic functions.
`POW(numeric_expr1, numeric_expr2)`	Returns the result of raising `numeric_expr1` to the power of `numeric_expr2`.
`RADIANS(numeric_expr)`	Returns `numeric_expr`, converted from degrees to radians. (Note that π radians equals 180 degrees.)
`ROUND(numeric_expr)`	Rounds the argument either up or down to the nearest whole number and returns the rounded value.
`SIN(numeric_expr)`	Returns the sine of the argument.
`SINH(numeric_expr)`	Returns the hyperbolic sine of the argument.
`SQRT(numeric_expr)`	Returns the square root of the expression.
`TAN(numeric_expr)`	Returns the tangent of the argument.
`TANH(numeric_expr)`	Returns the hyperbolic tangent of the argument.

Advanced examples

Scenario Description Example

Bounding box query

Scenario	Description	Example
Bounding box query	The following query returns a collection of points within a rectangular bounding box centered around San Francisco (37.46, -122.50).	SELECT year, month, AVG(mean_temp) avg_temp, MIN(min_temperature) min_temp, MAX(max_temperature) max_temp FROM [weather_geo.table] WHERE /* Return values between a pair of / / latitude and longitude coordinates */ lat / 1000 > 37.46 AND lat / 1000 < 37.65 AND long / 1000 > -122.50 AND long / 1000 < -122.30 GROUP BY year, month ORDER BY year, month ASC;
Approximate Bounding Circle Query	Return a collection of up to 100 points within an approximated circle determined by the using the Spherical Law of Cosines, centered around Denver Colorado (39.73, -104.98). This query makes use of BigQuery's mathematical and trigonometric functions, such as `PI()`, `SIN()`, and `COS()`. Because the Earth isn't an absolute sphere, and longitude+latitude converges at the poles, this query returns an approximation that can be useful for many types of data.	SELECT distance, lat, long, temp FROM (SELECT ((ACOS(SIN(39.73756700 * PI() / 180) * SIN((lat/1000) * PI() / 180) + COS(39.73756700 * PI() / 180) * COS((lat/1000) * PI() / 180) * COS((-104.98471790 - (long/1000)) * PI() / 180)) * 180 / PI()) * 60 * 1.1515) AS distance, AVG(mean_temp) AS temp, AVG(lat/1000) lat, AVG(long/1000) long FROM [weather_geo.table] WHERE month=1 GROUP BY distance) WHERE distance < 100 ORDER BY distance ASC LIMIT 100;

The following query returns a collection of points within a rectangular bounding box centered around San Francisco (37.46, -122.50).

SELECT
  year, month,
  AVG(mean_temp) avg_temp,
  MIN(min_temperature) min_temp,
  MAX(max_temperature) max_temp
FROM
  [weather_geo.table]
WHERE
  /* Return values between a pair of */ 
  /* latitude and longitude coordinates */
  lat / 1000 > 37.46 AND 
  lat / 1000 < 37.65 AND 
  long / 1000 > -122.50 AND 
  long / 1000 < -122.30
GROUP BY
  year, month
ORDER BY
  year, month ASC;

Approximate Bounding Circle Query

Return a collection of up to 100 points within an approximated circle determined by the using the Spherical Law of Cosines, centered around Denver Colorado (39.73, -104.98). This query makes use of BigQuery's mathematical and trigonometric functions, such as PI(), SIN(), and COS().

Because the Earth isn't an absolute sphere, and longitude+latitude converges at the poles, this query returns an approximation that can be useful for many types of data.

SELECT
  distance, lat, long, temp
FROM
  (SELECT
    ((ACOS(SIN(39.73756700 * PI() / 180) *
           SIN((lat/1000) * PI() / 180) + 
           COS(39.73756700 * PI() / 180) *
           COS((lat/1000) * PI() / 180) * 
           COS((-104.98471790 -
           (long/1000)) * PI() / 180)) * 
           180 / PI()) * 60 * 1.1515)
      AS distance,
     AVG(mean_temp) AS temp,
     AVG(lat/1000) lat, AVG(long/1000) long
FROM
  [weather_geo.table]
WHERE
  month=1 GROUP BY distance)
WHERE
  distance < 100
ORDER BY
  distance ASC
LIMIT 100;

Regular expression functions

BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.

Note that the regular expressions are global matches; to start matching at the beginning of a word you must use the ^ character.

Syntax

Function Description Example

REGEXP_MATCH('str', 'reg_exp')

Returns true if str matches the regular expression. For string matching without regular expressions, use CONTAINS instead of REGEXP_MATCH.

SELECT word, COUNT(word) AS count FROM publicdata:samples.shakespeare WHERE (REGEXP_MATCH(word,r'\w\w\'\w\w')) GROUP BY word ORDER BY count DESC LIMIT 3;

Returns:

word	count
ne'er	42
we'll	35
We'll	33

REGEXP_EXTRACT('str', 'reg_exp')

Returns the portion of str that matches the capturing group within the regular expression.

SELECT REGEXP_EXTRACT(word,r'(\w\w\'\w\w)') AS fragment FROM publicdata:samples.shakespeare GROUP BY fragment ORDER BY fragment LIMIT 3;

Returns:

fragment
null
Al'ce
As'es

REGEXP_REPLACE('orig_str', 'reg_exp', 'replace_str')

Returns a string where any substring of orig_str that matches reg_exp is replaced with replace_str. For example, REGEXP_REPLACE ('Hello', 'lo', 'p') returns Help.

Advanced examples

Scenario Description Example

Filter result set by regular expression match

Scenario	Description	Example
Filter result set by regular expression match	BigQuery's regular expression functions can be used to filter results in a `WHERE` clause, as well as to display results in the `SELECT`. The following example combines both of these regular expression use cases into a single query.	SELECT /* Replace underscores in the title with spaces / REGEXP_REPLACE(title, r'_', ' ') AS regexp_title, views FROM (SELECT title, SUM(views) as views FROM [bigquery-samples:wikimedia_pageviews.201201] WHERE NOT title CONTAINS ':' AND wikimedia_project='wp' AND language='en' / Match titles that start with 'G', / / end with 'e', and contain two 'o's / AND REGEXP_MATCH(title, r'^G.o.o.e$') GROUP BY title ORDER BY views DESC LIMIT 100)
Using regular expressions on integer or float data	While BigQuery's regular expression functions only work for string data, it's possible to use the `STRING()` function to cast integer or float data into string format. In this example, `STRING()` is used to cast the integer value `corpus_date` to a string, which is then altered by `REGEXP_REPLACE`.	SELECT corpus_date, /* Cast the corpus_date to a string value / REGEXP_REPLACE(STRING(corpus_date), '^16', 'Written in the sixteen hundreds, in the year \'' ) AS date_string FROM [publicdata:samples.shakespeare] / Cast the corpus_date to string, / / match values that begin with '16' */ WHERE REGEXP_MATCH(STRING(corpus_date), '^16') GROUP BY corpus_date, date_string ORDER BY date_string DESC LIMIT 5;

BigQuery's regular expression functions can be used to filter results in a WHERE clause, as well as to display results in the SELECT. The following example combines both of these regular expression use cases into a single query.

SELECT
  /* Replace underscores in the title with spaces  */
  REGEXP_REPLACE(title, r'_', ' ') AS regexp_title, views
  FROM
    (SELECT title, SUM(views) as views 
    FROM [bigquery-samples:wikimedia_pageviews.201201]
  WHERE
    NOT title CONTAINS ':'
    AND wikimedia_project='wp'
    AND language='en'
    /* Match titles that start with 'G', */
    /* end with 'e', and contain two 'o's  */
    AND REGEXP_MATCH(title, r'^G.*o.*o.*e$')
  GROUP BY
    title
  ORDER BY
    views DESC
  LIMIT 100)

Using regular expressions on integer or float data

While BigQuery's regular expression functions only work for string data, it's possible to use the STRING() function to cast integer or float data into string format. In this example, STRING() is used to cast the integer value corpus_date to a string, which is then altered by REGEXP_REPLACE.

SELECT
  corpus_date,
  /* Cast the corpus_date to a string value  */
  REGEXP_REPLACE(STRING(corpus_date),
    '^16',
    'Written in the sixteen hundreds, in the year \''
    ) AS date_string
FROM [publicdata:samples.shakespeare]
/* Cast the corpus_date to string, */
/* match values that begin with '16' */
WHERE
  REGEXP_MATCH(STRING(corpus_date), '^16')
GROUP BY
  corpus_date, date_string
ORDER BY
  date_string DESC
LIMIT 5;

String functions

String functions operate on string data. String constants must be enclosed with single or double quotes. String functions are case-sensitive by default and should use LATIN-1 encoding only (use UTF-8 encoding if necessary). You can append IGNORE CASE to the end of a query to enable case-insensitive matching. IGNORE CASE works only for LATIN-1 strings.

Wildcards are not supported in these functions; for regular expression functionality, use regular expression functions.

Syntax

Function	Description
`CONCAT('str1', 'str2')`	Returns the concatenation of two strings. Example: if `str1` is `Java` and `str2` is `Script`, `CONCAT` returns `JavaScript`.
`expr CONTAINS 'str'`	Returns `true` if `expr` contains the specified string argument. This is a case-sensitive comparison.
`LEFT('str', numeric_expr)`	Returns the leftmost numeric_expr characters of `str`. If the number is longer than str, the full string will be returned. Example: `LEFT('seattle', 3)` returns `sea`.
`LENGTH('str')`	Returns a numerical value for the length of the string. Example: if `str` is `'123456'`, `LENGTH` returns `6`.
`LOWER('str')`	Returns the original string with all characters in lower case. Works for LATIN-1 characters only.
`LPAD('str1', numeric_expr, 'str2')`	Returns the string `str2`, left-padded with the string `str1` to a length of `numeric_expr` characters. If the returned string would be longer than `const`, the return value is shortened to `const` characters. Example: `LPAD('ator', 7, 'senior')` returns `senator`.
`RIGHT('str', numeric_expr)`	Returns the righmost numeric_expr characters of `str`. If the number is longer than the string, it will return the whole string. Example: `RIGHT('kirkland', 4)` returns `land`.
`RPAD('str1', numeric_expr, 'str2')`	Returns the string `str2`, right-padded with the string `str1` to a length of `numeric_expr` characters. If the returned string would be longer than `const`, the return value is shortened to `const` characters. Example: `RPAD('west ', 12, 'seattleite')` returns `west seattle`.
`SUBSTR('str', const_from, const_len)`	Returns a substring of `str` that is up to `const_len` characters long, and that begins from the point in `str` that is specified by `const_from`. Counting starts at 1, so the first character in the string is in position 1 (not zero). If `const_from` is `5`, the substring begins with the 5th character from the left in `str`. If `const_from` is `-4`, the substring begins with the 4th character from the right in `str`. Example: `SUSBTR('awesome', -4, 4)` returns the substring `some`.
`UPPER('str')`	Returns the original string with all characters in upper case. Works for LATIN-1 characters only.

Escaping special characters in strings

To escape special characters, use one of the following methods:

Use'\xDD' notation, where '\x' is followed by the two-digit hex representation of the character.
Use an escaping slash in front of slashes, single quotes, and double quotes.
Use C-style sequences ('\a', '\b', '\f', '\n', '\r', '\t', and '\v') for other characters.

Some examples of escaping:

'this is a space: \x20'
'this string has \'single quote\' inside it'
'first line \n second line'
"double quotes are also ok"
'\070' -> ERROR: octal escaping is not supported

URL functions

Syntax

Function	Description
`HOST('url_str')`	Given a URL, returns the host name as a string. Example: HOST('http://www.google.com:80/index.html') returns 'www.google.com:80'
`DOMAIN('url_str')`	Given a URL, returns the domain as a string. Example: HOST('http://www.google.com:80/index.html') returns 'google.com'
`TLD('url_str')`	Given a URL, returns the top level domain plus any country domain in the URL. Example: TLD('http://www.google.com:80/index.html') returns '.com'. TLD('http://www.google.co.uk:80/index.html') returns '.co.uk'.

Advanced examples

Scenario Description Example

Parse domain names from URL data

Scenario	Description	Example
Parse domain names from URL data	This query uses the `DOMAIN()` function to return the most popular domains listed as repository homepages on GitHub. Note the use of HAVING to filter records using the result of the `DOMAIN()` function. This is a useful function to determine referrer information from URL data.	SELECT DOMAIN(repository_homepage) AS user_domain, COUNT() AS activity_count FROM [publicdata:samples.github_timeline] GROUP BY user_domain HAVING user_domain IS NOT NULL AND user_domain != '' ORDER BY activity_count DESC LIMIT 5; Returns:* +-----------------+----------------+ \| user_domain \| activity_count \| +-----------------+----------------+ \| github.com \| 281879 \| \| google.com \| 34769 \| \| khanacademy.org \| 17316 \| \| sourceforge.net \| 15103 \| \| mozilla.org \| 14091 \| +-----------------+----------------+ To look specifically at TLD information, use the `TLD()` function. This example displays the top TLDs that are not in a list of common examples. SELECT TLD(repository_homepage) AS user_tld, COUNT() AS activity_count FROM [publicdata:samples.github_timeline] GROUP BY user_tld HAVING / Only consider TLDs that are NOT NULL / / or in our list of common TLDs / user_tld IS NOT NULL AND NOT user_tld IN ('','.com','.net','.org','.info','.edu') ORDER BY activity_count DESC LIMIT 5; Returns:* +----------+----------------+ \| user_tld \| activity_count \| +----------+----------------+ \| .de \| 22934 \| \| .io \| 17528 \| \| .me \| 13652 \| \| .fr \| 12895 \| \| .co.uk \| 9135 \| +----------+----------------+

This query uses the DOMAIN() function to return the most popular domains listed as repository homepages on GitHub. Note the use of HAVING to filter records using the result of the DOMAIN() function. This is a useful function to determine referrer information from URL data.

SELECT
  DOMAIN(repository_homepage) AS user_domain,
  COUNT(*) AS activity_count
FROM
  [publicdata:samples.github_timeline]
GROUP BY
  user_domain
HAVING
  user_domain IS NOT NULL AND user_domain != ''
ORDER BY
  activity_count DESC
LIMIT 5;

Returns:

+-----------------+----------------+
|   user_domain   | activity_count |
+-----------------+----------------+
| github.com      |         281879 |
| google.com      |          34769 |
| khanacademy.org |          17316 |
| sourceforge.net |          15103 |
| mozilla.org     |          14091 |
+-----------------+----------------+

To look specifically at TLD information, use the TLD() function. This example displays the top TLDs that are not in a list of common examples.

SELECT
  TLD(repository_homepage) AS user_tld,
  COUNT(*) AS activity_count
FROM
  [publicdata:samples.github_timeline]
GROUP BY
  user_tld
HAVING
  /* Only consider TLDs that are NOT NULL */
  /* or in our list of common TLDs */
  user_tld IS NOT NULL AND NOT user_tld 
  IN ('','.com','.net','.org','.info','.edu')
ORDER BY
  activity_count DESC
LIMIT 5;

Returns:

+----------+----------------+
| user_tld | activity_count |
+----------+----------------+
| .de      |          22934 |
| .io      |          17528 |
| .me      |          13652 |
| .fr      |          12895 |
| .co.uk   |           9135 |
+----------+----------------+

Window functions

Window functions enable calculations on a specific partition, or "window", of a result set. Each window function expects an OVER clause that specifies the partition, in the following syntax:

OVER ([PARTITION BY <expr>] [ORDER BY <expr>])

PARTITION BY is always optional. ORDER BY is optional in some cases, but certain window functions, such as rank() or dense_rank(), require the clause.

All instances of an OVER clause in a SELECT statement must match.

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions. Window functions can't be used to generate large query results.

The query examples associated with each of the following window functions are based upon the publicdata:samples.shakespeare dataset.

Syntax

Function Description Example

CUME_DIST()

Returns a double that indicates the cumulative distribution of a value in a group of values, calculated using the formula <number of rows preceding or tied with the current row> / <total rows>. Tied values return the same cumulative distribution value.

This window function requires ORDER BY in the OVER clause.

SELECT word, word_count, CUME_DIST() OVER (PARTITION BY corpus ORDER BY word_count DESC) cume_dist, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	cume_dist
handkerchief	29	0.2
satisfaction	5	0.4
displeasure	4	0.8
instruments	4	0.8
circumstance	3	1.0

DENSE_RANK()

Returns the integer rank of a value in a group of values. The rank is calculated based on comparisons with other values in the group.

Tied values display as the same rank. The rank of the next value is incremented by 1. For example, if two values tie for rank 2, the next ranked value is 3. If you prefer a gap in the ranking list, use rank().

This window function requires ORDER BY in the OVER clause.

SELECT word, word_count, DENSE_RANK() OVER (PARTITION BY corpus ORDER BY word_count DESC) dense_rank, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	dense_rank
handkerchief	29	1
satisfaction	5	2
displeasure	4	3
instruments	4	3
circumstance	3	4

LAG(<expr>, <offset>, <default_value>)

Returns the value of <expr> for the row located <offset> rows before the current row. If the row doesn't exist, <default_value> returns.

SELECT word, word_count, LAG(word, 1) OVER (PARTITION BY corpus ORDER BY word_count DESC) lag, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	lag
handkerchief	29	null
satisfaction	5	handkerchief
displeasure	4	satisfaction
instruments	4	displeasure
circumstance	3	instruments

LEAD(<expr>, <offset>, <default_value>)

Returns the value of <expr> for the row located <offset> rows after the current row. If the row doesn't exist, <default_value> returns.

SELECT word, word_count, LEAD(word, 1) OVER (PARTITION BY corpus ORDER BY word_count DESC) lead, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	lead
handkerchief	29	satisfaction
satisfaction	5	displeasure
displeasure	4	instruments
instruments	4	circumstance
circumstance	3	null

NTILE(<num_buckets>)

Divides a sequence of rows into <num_buckets> buckets and assigns a corresponding bucket number, as an integer, with each row. The ntile() function assigns the bucket numbers as equally as possible and returns a value from 1 to <num_buckets> for each row.

SELECT word, word_count, NTILE(2) OVER (PARTITION BY corpus ORDER BY word_count DESC) ntile, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	ntile
handkerchief	29	1
satisfaction	5	1
displeasure	4	1
instruments	4	2
circumstance	3	2

PERCENT_RANK()

Returns the rank of the current row, relative to the other rows in the partition. Returned values range between 0 and 1, inclusively. The first value returned is 0.0.

This window function requires ORDER BY in the OVER clause.

SELECT word, word_count, PERCENT_RANK() OVER (PARTITION BY corpus ORDER BY word_count DESC) p_rank, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	p_rank
handkerchief	29	0.0
satisfaction	5	0.25
displeasure	4	0.5
instruments	4	0.5
circumstance	3	1.0

PERCENTILE_CONT(<percentile>)

Returns values that are based upon linear interpolation between the values of the group, after ordering them per the ORDER BY clause.

<percentile> must be between 0 and 1.

This window function requires ORDER BY in the OVER clause.

SELECT word, word_count, PERCENTILE_CONT(0.5) OVER (PARTITION BY corpus ORDER BY word_count DESC) p_cont, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	p_cont
handkerchief	29	4
satisfaction	5	4
displeasure	4	4
instruments	4	4
circumstance	3	4

PERCENTILE_DISC(<percentile>)

Returns the value with the smallest cumulative distribution that's greater or equal to <percentile>.

<percentile> must be between 0 and 1.

This window function requires ORDER BY in the OVER clause.

SELECT word, word_count, PERCENTILE_DISC(0.5) OVER (PARTITION BY corpus ORDER BY word_count DESC) p_disc, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	p_disc
handkerchief	29	4
satisfaction	5	4
displeasure	4	4
instruments	4	4
circumstance	3	4

RANK()

Returns the integer rank of a value in a group of values. The rank is calculated based on comparisons with other values in the group.

Tied values display as the same rank. The rank of the next value is incremented according to how many tied values occurred before it. For example, if two values tie for rank 2, the next ranked value is 4, not 3. If you prefer no gaps in the ranking list, use dense_rank().

This window function requires ORDER BY in the OVER clause.

SELECT word, word_count, RANK() OVER (PARTITION BY corpus ORDER BY word_count DESC) rank, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	rank
handkerchief	29	1
satisfaction	5	2
displeasure	4	3
instruments	4	3
circumstance	3	5

RATIO_TO_REPORT(<column>)

Returns the ratio of each value to the sum of the values, as a double between 0 and 1.

SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER (PARTITION BY corpus ORDER BY word_count DESC) r_to_r, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	r_to_r
handkerchief	29	0.6444444444444445
satisfaction	5	0.1111111111111111
displeasure	4	0.08888888888888889
instruments	4	0.08888888888888889
circumstance	3	0.06666666666666667

ROW_NUMBER()

Returns the current row number of the query result, starting with 1.

SELECT word, word_count, ROW_NUMBER() OVER (PARTITION BY corpus ORDER BY word_count DESC) row_num, FROM [publicdata:samples.shakespeare] WHERE corpus='othello' and length(word) > 10 LIMIT 5

Returns:

word	word_count	row_num
handkerchief	29	1
satisfaction	5	2
displeasure	4	3
instruments	4	4
circumstance	3	5

Other functions

Syntax

Function	Description
`CASE WHEN when_expr1 THEN then_expr1 WHEN when_expr2 THEN then_expr2 ... ELSE else_expr END`	Use CASE to choose among two or more alternate expressions in your query. WHEN expressions must be boolean, and all the expressions in THEN clauses and ELSE clause must be compatible types.
`HASH(expr)`	Computes and returns a 64-bit hash value of the bytes of `expr` as defined by the CityHash library. Any string or integer expression is supported and the function respects `IGNORE CASE` for strings, returning case invariant values.
`IF(condition, true_return, false_return)`	Returns either `true_return` or `false_return`, depending on whether `condition` is true or false. The return values can be literals or field-derived values, but they must be the same data type. Field-derived values do not need to be included in the `SELECT` clause.
`POSITION(field)`	For repeated fields, returns the sequential position of the value within the repeated values.

Advanced examples

Scenario Description Example

Bucketing results into categories using conditionals

Scenario	Description	Example
Bucketing results into categories using conditionals	The following query uses a `CASE/WHEN` block to bucket results into "region" categories based on a list of states. If the state does not appear as an option in one of the `WHEN` statements, the state value will default to "None."	SELECT CASE WHEN state IN ('WA', 'OR', 'CA', 'AK', 'HI', 'ID', 'MT', 'WY', 'NV', 'UT', 'CO', 'AZ', 'NM') THEN 'West' WHEN state IN ('OK', 'TX', 'AR', 'LA', 'TN', 'MS', 'AL', 'KY', 'GA', 'FL', 'SC', 'NC', 'VA', 'WV', 'MD', 'DC', 'DE') THEN 'South' WHEN state IN ('ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO', 'WI', 'IL', 'IN', 'MI', 'OH') THEN 'Midwest' WHEN state IN ('NY', 'PA', 'NJ', 'CT', 'RI', 'MA', 'VT', 'NH', 'ME') THEN 'Northeast' ELSE 'None' END as region, average_mother_age, average_father_age, state, year FROM (SELECT year, state, SUM(mother_age)/COUNT(mother_age) as average_mother_age, SUM(father_age)/COUNT(father_age) as average_father_age FROM publicdata:samples.natality WHERE father_age < 99 GROUP BY year, state) ORDER BY year LIMIT 5; Returns: +--------+--------------------+--------------------+-------+------+ \| region \| average_mother_age \| average_father_age \| state \| year \| +--------+--------------------+--------------------+-------+------+ \| South \| 24.342600163532296 \| 27.683769419460344 \| AR \| 1969 \| \| West \| 25.185041908446163 \| 28.268214055448098 \| AK \| 1969 \| \| West \| 24.780776677578217 \| 27.831181063905248 \| CA \| 1969 \| \| West \| 25.005834769924412 \| 27.942978384829598 \| AZ \| 1969 \| \| South \| 24.541730952905738 \| 27.686430093306885 \| AL \| 1969 \| +--------+--------------------+--------------------+-------+------+
Simulating a Pivot Table	Use conditional statements to organize the results of a subselect query into rows and columns. In the example below, results from a search for most revised Wikipedia articles that start with the value 'Google' are organized into columns where the revision counts are displayed if they meet various criterea.	SELECT page_title, /* Populate these columns as True or False, / / depending on the condition / IF (page_title CONTAINS 'search', INTEGER(total), 0) AS search, IF (page_title CONTAINS 'Earth' OR page_title CONTAINS 'Maps', INTEGER(total), 0) AS geo, FROM / Subselect to return top revised Wikipedia articles / / containing 'Google', followed by additional text. / (SELECT TOP (title, 5) as page_title, COUNT () as total FROM [publicdata:samples.wikipedia] WHERE REGEXP_MATCH (title, r'^Google.+') AND wp_namespace = 0 ); Returns: +---------------+--------+------+ \| page_title \| search \| geo \| +---------------+--------+------+ \| Google search \| 4261 \| 0 \| \| Google Earth \| 0 \| 3874 \| \| Google Chrome \| 0 \| 0 \| \| Google Maps \| 0 \| 2617 \| \| Google bomb \| 0 \| 0 \| +---------------+--------+------+
Using HASH to select a random sample of your data	Some queries can provide a useful result using random subsampling of the result set. To retrieve a random sampling of values, use the `HASH` function to return results in which the modulo "n" of the hash equals zero. For example, the following query will find the `HASH()` of the "title" value, and then checks if that value modulo "2" is zero. This should result in about 50% of the values being labeled as "sampled." To sample fewer values, increase the value of the modulo operation from "2" to something larger.	SELECT title, HASH(title) AS hash_value, IF(HASH(title) % 2 == 0, 'True', 'False') AS included_in_sample FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0 LIMIT 5;

The following query uses a CASE/WHEN block to bucket results into "region" categories based on a list of states. If the state does not appear as an option in one of the WHEN statements, the state value will default to "None."

SELECT
  CASE
    WHEN state IN ('WA', 'OR', 'CA', 'AK', 'HI', 'ID',
                   'MT', 'WY', 'NV', 'UT', 'CO', 'AZ', 'NM')
      THEN 'West'
    WHEN state IN ('OK', 'TX', 'AR', 'LA', 'TN', 'MS', 'AL',
                   'KY', 'GA', 'FL', 'SC', 'NC', 'VA', 'WV', 
                   'MD', 'DC', 'DE')
      THEN 'South'
    WHEN state IN ('ND', 'SD', 'NE', 'KS', 'MN', 'IA',
                   'MO', 'WI', 'IL', 'IN', 'MI', 'OH')
      THEN 'Midwest'
    WHEN state IN ('NY', 'PA', 'NJ', 'CT',
                   'RI', 'MA', 'VT', 'NH', 'ME')
      THEN 'Northeast'
    ELSE 'None'
  END as region,
  average_mother_age,
  average_father_age,
  state, year
FROM
  (SELECT
     year, state,
     SUM(mother_age)/COUNT(mother_age) as average_mother_age,
     SUM(father_age)/COUNT(father_age) as average_father_age
   FROM
     publicdata:samples.natality
   WHERE
     father_age < 99
   GROUP BY
     year, state)
ORDER BY
  year
LIMIT 5;

Returns:

+--------+--------------------+--------------------+-------+------+
| region | average_mother_age | average_father_age | state | year |
+--------+--------------------+--------------------+-------+------+
| South  | 24.342600163532296 | 27.683769419460344 | AR    | 1969 |
| West   | 25.185041908446163 | 28.268214055448098 | AK    | 1969 |
| West   | 24.780776677578217 | 27.831181063905248 | CA    | 1969 |
| West   | 25.005834769924412 | 27.942978384829598 | AZ    | 1969 |
| South  | 24.541730952905738 | 27.686430093306885 | AL    | 1969 |
+--------+--------------------+--------------------+-------+------+

Simulating a Pivot Table

Use conditional statements to organize the results of a subselect query into rows and columns. In the example below, results from a search for most revised Wikipedia articles that start with the value 'Google' are organized into columns where the revision counts are displayed if they meet various criterea.

SELECT
  page_title,
  /* Populate these columns as True or False, */
  /*  depending on the condition */
  IF (page_title CONTAINS 'search', 
      INTEGER(total), 0) AS search,
  IF (page_title CONTAINS 'Earth' OR 
      page_title CONTAINS 'Maps', INTEGER(total), 0) AS geo,
FROM
  /* Subselect to return top revised Wikipedia articles */
  /* containing 'Google', followed by additional text. */
  (SELECT
    TOP (title, 5) as page_title,
    COUNT (*) as total
   FROM
     [publicdata:samples.wikipedia]
   WHERE
     REGEXP_MATCH (title, r'^Google.+') AND wp_namespace = 0
  );

Returns:

+---------------+--------+------+
|  page_title   | search | geo  |
+---------------+--------+------+
| Google search |   4261 |    0 |
| Google Earth  |      0 | 3874 |
| Google Chrome |      0 |    0 |
| Google Maps   |      0 | 2617 |
| Google bomb   |      0 |    0 |
+---------------+--------+------+

Using HASH to select a random sample of your data

Some queries can provide a useful result using random subsampling of the result set. To retrieve a random sampling of values, use the HASH function to return results in which the modulo "n" of the hash equals zero.

For example, the following query will find the HASH() of the "title" value, and then checks if that value modulo "2" is zero. This should result in about 50% of the values being labeled as "sampled." To sample fewer values, increase the value of the modulo operation from "2" to something larger.

SELECT
  title,
  HASH(title) AS hash_value,
  IF(HASH(title) % 2 == 0, 'True', 'False') 
    AS included_in_sample
FROM
  [publicdata:samples.wikipedia]
WHERE
  wp_namespace = 0
LIMIT 5;

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies.

Google BigQuery Feedback on this document

Query Reference

Contents

Query syntax

SELECT clause

WITHIN clause

FROM clause

FLATTEN clause

JOIN clause

WHERE clause

GROUP BY clause

HAVING clause

ORDER BY clause

LIMIT clause

Supported functions and operators

Aggregate functions

Syntax

TOP() function

Advanced examples

Arithmetic operators

Syntax

Bitwise operators

Syntax

Casting functions

Syntax

Comparison functions

Syntax

Date and time functions

Syntax

Advanced examples

IP functions

Syntax

Logical operators

Syntax

Mathematical functions

Syntax

Advanced examples

Regular expression functions

Syntax

Advanced examples

String functions

Syntax

URL functions

Syntax

Advanced examples

Window functions

Syntax

Other functions

Syntax

Advanced examples

Authentication required

Signing you in...

Google BigQuery

Feedback on this document