在每日批处理作业

时间:2017-10-07 19:02:29

标签: python resources google-bigquery

对于bigquery python客户端,已经配置为使用标准SQL

query_job = self.client.run_async_query(str(uuid.uuid4()), query_str)
query_job.use_query_cache = True     #  query cache 
query_job.use_legacy_sql = False

但是,在发送查询时,在批处理作业中间获得以下400个错误 - 在执行期间超出了抱怨资源。查询相当简单 - 在每日分区表中在30分钟范围内获得及时排序的行(每天有大约4000万行,总共15-20G数据)。由于每个查询覆盖30分钟的范围,因此相同的查询将运行48次以覆盖一天。每个查询返回500k - 150万行,数据量在几百MB的范围内。以下查询最初执行得很好,但只有在10-20次迭代后,才会弹出RESOURCES exceeds错误。

可以在帮助之前得到相同问题的大型专家,专家,开发人员可以提供一些暗示,这里可能出现问题。真的很感激!

罗伊

SELECT
  user_id,
  client_ip,
  url,
  req_ts,
  req_body,
  resp_body,
  status
FROM
  xxxx.table
WHERE
  DATE(_PARTITIONTIME) = '2017-09-16'
  AND req_ts >= '2017-09-16 15:30:00'
  AND req_ts < '2017-09-16 16:00:00' order by req_ts



File "../datastore/bigquery.py", line 202, in sendQuery

    query_job.result()  #Wait for job to complete

  File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/job.py", line 492, in result

    return super(_AsyncJob, self).result(timeout=timeout)

  File "/usr/local/lib/python2.7/dist-packages/google/api/core/future/polling.py", line 104, in result

    self._blocking_poll(timeout=timeout)

  File "/usr/local/lib/python2.7/dist-packages/google/api/core/future/polling.py", line 84, in _blocking_poll

    retry_(self._done_or_raise)()

  File "/usr/local/lib/python2.7/dist-packages/google/api/core/retry.py", line 258, in retry_wrapped_func

    on_error=on_error,

  File "/usr/local/lib/python2.7/dist-packages/google/api/core/retry.py", line 175, in retry_target

    return target()

  File "/usr/local/lib/python2.7/dist-packages/google/api/core/future/polling.py", line 62, in _done_or_raise

    if not self.done():

  File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/job.py", line 1301, in done

    self._query_results = self._client.get_query_results(self.name)

  File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 196, in get_query_results

    method='GET', path=path, query_params=extra_params)

  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 293, in api_request

    raise exceptions.from_http_response(response)

BadRequest: 400 GET https://www.googleapis.com/bigquery/v2/projects/fluted-house-161501/queries/ab8534f8-fe52-448c-84fe-b8702ee7b87c?maxResults=0: Resources exceeded during query execution: The query could not be executed in the allotted memory.

1 个答案:

答案 0 :(得分:3)

问题出现在ORDER BY中,导致整个结果在输出结果之前被移动到一个工人进行最终排序。如果结果足够大,这通常会导致“在查询执行期间超出资源”

这里的建议是添加LIMIT一些合理的数字 - 在这种情况下 - 部分排序发生在所有工人身上,最终排序是在一个节点上进行的,但现在结果非常简单,或者只是删除ORDER BY并按顺序进行排序客户端

Order query operations to maximize performance查看有关ORDER BY的更多信息。请查看第二段