The document describes how to run synchronous and asynchronous queries from the BigQuery API.
Contents
- Overview
- Additional limits
- Synchronous queries
- Asynchronous queries
- Interactive and batch queries
- Query caching
- Returning large query results
- Table decorators
Overview
Queries are written in BigQuery's SQL dialect. BigQuery supports both synchronous and asynchronous query methods. Both methods are handled by a job, but the "synchronous" method exposes a timeout value that waits until the job has finished before returning.
All query results are saved to a table, which can be either persistent or temporary:
- A persistent table can be a new or existing table in any dataset in which you have WRITE privileges
- A temporary table is a randomly named table saved in a special dataset; The table has a lifetime of approximately 24 hours. Temporary tables are not available for sharing, and are not visible using any of the standard list or other table manipulation methods.
The following table summarizes the possible combinations of query type and results table lifetime:
Results Table Lifetime | Query Type |
|
---|---|---|
Synchronous |
Asynchronous |
|
Permanent Table | Not available in a single call. You can call jobs.insert and specify a results table, then call jobs.getQueryResults on that job, which will wait for the query to finish, then return results. details... | Create a query job and specify the table into which store the results. This can be a new table, or an existing table that you can either overwrite or append results to. details...
|
Temporary Table | Data stored in a temporary table available only to the current user. Table lifetime is about 24 hours. details...
|
Create a query job but do not specify an output table. Results are stored in a temporary table with a lifetime of about 24 hours. details...
|
Additional limits
The following additional limits apply for querying data.
- Maximum tables per query: 1,000
- Maximum query length: 256 KB
Synchronous queries
All synchronous query results are saved to an anonymous table with a lifetime of about 24 hours. If you want to save the data for longer than that, re-run the query asynchronously and specify a permanent destination table. Synchronous queries are handled by a job, as all queries are, and will appear in your jobs history.
- Call jobs.query with
your query string. This method takes an optional timeout period; if set to 0 or not specified, the method will default to 10 seconds.
- If the query returns within the specified timeout period, the method will return the first page of results. For additional results, call jobs.getQueryResults as described next.
- If the query timeout expires before the query has finished, the method will return
jobComplete=false
,, and you must call jobs.getQueryResults as described next. The query job will continue to run even after the timeout period until it finishes, either successfully or because an error has occurred.
- [If necessary] Call jobs.getQueryResults to page through additional results, to get results from a query that exceeded its timeout, or to examine any query results until the temporary table is deleted. This method lets you specify a start row, and also takes a timeout that behaves the same as the jobs.query timeout to allow waiting if the job is not yet complete.
Synchronous Query Examples
Java
This sample uses the Google APIs Client Library for Java.
/** * Runs a synchronous BigQuery query and displays the result. * * @param bigquery An authorized BigQuery client * @param projectId The current project id * @param query A String containing a BigQuery SQL statement * @param out A PrintStream for output, normally System.out */ static void runQueryRpcAndPrint( Bigquery bigquery, String projectId, String query, PrintStream out) throws IOException { QueryRequest queryRequest = new QueryRequest().setQuery(query); QueryResponse queryResponse = bigquery.jobs().query(projectId, queryRequest).execute(); if (queryResponse.getJobComplete()) { printRows(queryResponse.getRows(), out); if (null == queryResponse.getPageToken()) { return; } } // This loop polls until results are present, then loops over result pages. String pageToken = null; while (true) { GetQueryResultsResponse queryResults = bigquery.jobs() .getQueryResults(projectId, queryResponse.getJobReference().getJobId()) .setPageToken(pageToken).execute(); if (queryResults.getJobComplete()) { printRows(queryResults.getRows(), out); pageToken = queryResults.getPageToken(); if (null == pageToken) { return; } } } } private static void printRows(List<TableRow> rows, PrintStream out) { if (rows != null) { for (TableRow row : rows) { for (TableCell cell : row.getF()) { // Data.isNull() is the recommended way to check for the 'null object' in TableCell. out.printf("%s, ", Data.isNull(cell.getV()) ? "null" : cell.getV()); } out.println(); } } }
Python
This sample uses the Google APIs Client Library for Python.
# Run a synchronous query, save the results to a table, overwriting the # existing data, and print the first page of results. # Default timeout is to wait until query finishes. def runSyncQuery (service, projectId, datasetId, timeout=0): try: print 'timeout:%d' % timeout jobCollection = service.jobs() queryData = {'query':'SELECT word,count(word) AS count FROM publicdata:samples.shakespeare GROUP BY word;', 'timeoutMs':timeout} queryReply = jobCollection.query(projectId=projectId, body=queryData).execute() jobReference=queryReply['jobReference'] # Timeout exceeded: keep polling until the job is complete. while(not queryReply['jobComplete']): print 'Job not yet complete...' queryReply = jobCollection.getQueryResults( projectId=jobReference['projectId'], jobId=jobReference['jobId'], timeoutMs=timeout).execute() # If the result has rows, print the rows in the reply. if('rows' in queryReply): print 'has a rows attribute' printTableData(queryReply, 0) currentRow = len(queryReply['rows']) # Loop through each page of data while('rows' in queryReply and currentRow < queryReply['totalRows']): queryReply = jobCollection.getQueryResults( projectId=jobReference['projectId'], jobId=jobReference['jobId'], startIndex=currentRow).execute() if('rows' in queryReply): printTableData(queryReply, currentRow) currentRow += len(queryReply['rows']) except AccessTokenRefreshError: print ("The credentials have been revoked or expired, please re-run" "the application to re-authorize") except HttpError as err: print 'Error in runSyncQuery:', pprint.pprint(err.content) except Exception as err: print 'Undefined error' % err
Asynchronous queries
Asynchronous queries are run by calling bigquery.jobs.insert and including the configuration.query
property in the request. The method returns immediately with a job ID, and you must either request the job periodically and check status, or call jobs.getQueryResults, which will return when the job is complete.
Asynchronous queries are always saved to a table: this can be a new table, an existing table, or a temporary table. You can choose whether to append or overwrite data in an existing table, and whether to create a new table if none exists by that name.
- Create a query job by calling bigquery.jobs.insert. If saving results to a permanent table, populate the
destinationTable
property with target table information; if saving the results to a temporary table, omit that property. The method returns immediately. Save the jobId returned in the response - Check job status by calling bigquery.jobs.get with
the jobId by the insert request. Check the
status.state
value to see if the job has completed successfully; if not, examinestatus.errorResult
to learn why the job failed. - Once the job reports a successful result, browse through the table data of the results table that you specified. If you got an error message, examine the error and rerun as appropriate.
Tip: An alternative to steps 2 and 3, polling job status and calling tabledata.list, is to call jobs.getQueryResults(jobId) with no timeout, which will wait until the job is done and then return the query results at any row index that you like.
Asynchronous Query Examples
Java
This sample uses the Google APIs Client Library for Java.
static void runQueryJobAndPrint( Bigquery bigquery, String projectId, String query, PrintStream out) throws IOException, InterruptedException { Job queuedJob = startQuery(bigquery, projectId, query, createJobId(query)); pollUntilQueryDone(bigquery, queuedJob.getJobReference(), 1); printAllRows(bigquery, queuedJob, out); } private static void printAllRows(Bigquery bigquery, Job queuedJob, PrintStream out) throws IOException { String pageToken = null; while (true) { // queuedJob.getConfiguration().getQuery().getDestinationTable() points to a temporary table // automatically created to hold results. TableDataList tableDataList = bigquery.tabledata().list( queuedJob.getConfiguration().getQuery().getDestinationTable().getProjectId(), queuedJob.getConfiguration().getQuery().getDestinationTable().getDatasetId(), queuedJob.getConfiguration().getQuery().getDestinationTable().getTableId()) .setPageToken(pageToken).execute(); printRows(tableDataList.getRows(), out); pageToken = tableDataList.getPageToken(); if (null == pageToken) { break; } } } /** * Creates and starts a job that queries a table. * * @param bigquery an authorized BigQuery client * @param projectId a String containing the current Project ID * @param query the actual query string * @param jobId unique identifier for the job being inserted */ private static Job startQuery(Bigquery bigquery, String projectId, String query, String jobId) throws IOException { System.out.format("\nInserting Query Job: %s\n", query); JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query); JobConfiguration config = new JobConfiguration().setQuery(queryConfig); Job job = new Job().setId(jobId).setConfiguration(config); Job queuedJob = bigquery.jobs().insert(projectId, job).execute(); System.out.format("\nJob ID of Query Job is: %s\n", jobId); return queuedJob; } /** * Polls BigQuery until the job's status is "DONE". * * @param bigquery an authorized BigQuery client * @param jobReference a JobReference, containing the job and project id * @param pauseSeconds seconds to wait between poll attempts */ private static void pollUntilQueryDone( Bigquery bigquery, JobReference jobReference, int pauseSeconds) throws IOException, InterruptedException { // Variables that track the total query time long startTime = System.currentTimeMillis(); long elapsedTime; while (true) { Job pollJob = bigquery.jobs().get(jobReference.getProjectId(), jobReference.getJobId()).execute(); elapsedTime = System.currentTimeMillis() - startTime; System.out.format("Job status (%dms) %s: %s\n", elapsedTime, jobReference.getJobId(), pollJob.getStatus().getState()); if (pollJob.getStatus().getState().equals("DONE")) { System.out.format( "Job done, processed %s bytes.\n", pollJob.getStatistics().getTotalBytesProcessed()); return; } // Pause execution for pauseSeconds before polling job status again, to // reduce unnecessary calls to the BigQuery API and lower overall // application bandwidth. Thread.sleep(pauseSeconds * 1000); } }
Python
This sample uses the Google APIs Client Library for Python.
def runAsyncQuery (service, projectId): try: jobCollection = service.jobs() queryString = 'SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;' jobData = { 'configuration': { 'query': { 'query': queryString, } } } insertResponse = jobCollection.insert(projectId=projectId, body=jobData).execute() # Get query results. Results will be available for about 24 hours. currentRow = 0 queryReply = jobCollection.getQueryResults( projectId=projectId, jobId=insertResponse['jobReference']['jobId'], startIndex=currentRow).execute() while(('rows' in queryReply) and currentRow < queryReply['totalRows']): printTableData(queryReply, currentRow) currentRow += len(queryReply['rows']) queryReply = jobCollection.getQueryResults( projectId=projectId, jobId=queryReply['jobReference']['jobId'], startIndex=currentRow).execute() except HttpError as err: print 'Error in runAsyncTempTable:', pprint.pprint(err.resp) except Exception as err: print 'Undefined error' % err
Interactive and batch queries
By default, BigQuery runs interactive queries, meaning that the query is executed as soon as possible. Interactive queries count towards your concurrent rate limit and your daily rate limit but, alternatively, you can also choose to run your queries as batch queries. Batch queries are executed at a lower priority than interactive queries, taking from 30 minutes to 3 hours after the query has been issued before it is executed. However, batch queries do not count towards your concurrent rate limit and are charged at a lower rate than standard queries. If you have queries that aren't time sensitive, e.g. if you regularly run a set of standard queries to calculate daily aggregates, you can use batch querying to save money and quota. This can also be ideal if you tend to run queries overnight, so you do not mind the possible 3 hour delay in running these queries.
To use batch querying in your applications, include the priority
field in your code. For example, use the priority
field for asynchronous queries to batch them like so:
Java
This sample uses the Google APIs Client Library for Java.
static void runBatchedQueryAndPrint( Bigquery bigquery, String projectId, String query, PrintStream out) throws IOException, InterruptedException { Job queuedJob = startBatchQuery(bigquery, projectId, query, createJobId(query)); // Thread.sleep(30 * 60 * 1000); // Wait 30 minutes before pollling for a batched job. pollUntilQueryDone(bigquery, queuedJob.getJobReference(), 300); printAllRows(bigquery, queuedJob, out); } /** * A job id is an unique identifier for jobs submitted to BigQuery. It can be automatically * created, but it is a recommended practice to set them manually for posterior tracking, and * preventing duplicate jobs. */ private static String createJobId(String query) { return String.format("job_%d_%d", System.currentTimeMillis(), query.hashCode()); } /** * Creates and starts a job that queries a table, with BATCH priority. * * @param bigquery an authorized BigQuery client * @param projectId a String containing the current Project ID * @param query the actual query string * @param jobId unique identifier for the job being inserted */ private static Job startBatchQuery( Bigquery bigquery, String projectId, String query, String jobId) throws IOException { System.out.format("\nInserting Batched Query Job: %s\n", query); JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query); queryConfig.setPriority("BATCH"); // Set query priority to batch. JobConfiguration config = new JobConfiguration().setQuery(queryConfig); Job job = new Job().setId(jobId).setConfiguration(config); Job queuedJob = bigquery.jobs().insert(projectId, job).execute(); System.out.format("\nJob ID of Query Job is: %s\n", jobId); return queuedJob; }
Python
This sample uses the Google APIs Client Library for Python.
def runAsyncQueryBatch(service, projectId): try: jobCollection = service.jobs() queryString = 'SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;' jobData = { 'configuration': { 'query': { 'query': queryString, 'priority': 'BATCH' # Set priority to BATCH } } } insertResponse = jobCollection.insert(projectId=projectId, body=jobData).execute() import time while True: status = jobCollection.get(projectId=projectId, jobId=insertResponse['jobReference']['jobId']).execute() currentStatus = status['status']['state'] if 'DONE' == currentStatus: currentRow = 0 queryReply = jobCollection.getQueryResults( projectId=projectId, jobId=insertResponse['jobReference']['jobId'], startIndex=currentRow).execute() while(('rows' in queryReply) and currentRow < queryReply['totalRows']): printTableData(queryReply, currentRow) currentRow += len(queryReply['rows']) queryReply = jobCollection.getQueryResults( projectId=projectId, jobId=queryReply['jobReference']['jobId'], startIndex=currentRow).execute() else: print 'Waiting for the query to complete...' print 'Current status: ' + currentStatus print time.ctime() time.sleep(10) except HttpError as err: print 'Error in runAsyncTempTable:', pprint.pprint(err.resp) except Exception as err: print 'Undefined error: %s' % err
When you run these samples, you will notice that your job status remains PENDING
for at least 30 minutes. To see how an interactive query might work, change the priority
setting to INTERACTIVE
or remove it altogether.
Query caching
BigQuery caches query results on a best-effort basis for increased performance. You aren't charged for cached queries, but cached queries are subject to the same quota policies as non-cached queries. BigQuery caches all queries that don't specify a destination table, including both interactive and batch queries.
Opting out of query caching
If you don't want your query to pull from the cache, you can take one of the following steps:
-
If using the BigQuery web UI, click the Enable Options button and ensure that Use Cached Results is unchecked.
- If using the BigQuery API, set the
useQueryCache
property tofalse
.
Ensuring cached query results
If you use the jobs.insert() function to run a query, you can ensure that the job returns a query result from the cache, if it exists, by setting the createDisposition property of the job configuration to CREATE_NEVER
.
If the query result doesn't exist in the cache, a NOT_FOUND
error returns.
Checking to see if BigQuery returned a cached result
There are two ways to determine if BigQuery returned a cached result:
-
If using the BigQuery web UI, the result string does not contain information about the number of processed bytes, and displays the word "cached".
-
If using the BigQuery API, the
cacheHit
property in the query result is set totrue
.
Returning large query results
Normally, queries have a maximum response size. If you plan to run a query that might return larger results, you can set allowLargeResults
to true
in your job configuration.
Queries that return large results will take longer to execute, even if the result set is small, and are subject to additional limitations:
- You must specify a destination table.
- You can't specify a top-level
ORDER BY
clause. - Window functions can't return large query results.
Table decorators
Normally, BigQuery performs a full column scan when running a query. You can use table decorators to perform a more cost-effective query of a subset of your data.
BigQuery supports the following table decorators:
Table decorator | Description |
---|---|
@<time> |
References a snapshot of the table at |
@-<time> |
References table data added in the last |
@<time1>-<time2> |
References table data added between |
@<time>- |
References data added to the table after |