在循环中查询来自bigquery的数据时出现错误请求错误

时间:2018-06-13 13:30:08

标签: python google-bigquery

我在循环中使用下面提到的get_data_from_bq方法从bigquery查询数据:

def get_data_from_bq(product_ids):
    format_strings = ','.join([("\"" + str(_id) + "\"") for _id in product_ids])
    query = "select productId, eventType, count(*) as count from [xyz:xyz.abc] where productId in (" + format_strings + ") and eventTime > CAST(\"" + time_thresh +"\" as DATETIME) group by eventType, productId order by productId;"
    query_job = bigquery_client.query(query, job_config=job_config)
    return query_job.result()

虽然对于第一个查询(迭代)返回的数据是正确的,但所有后续查询都抛出了下面提到的异常

    results = query_job.result()
  File "/home/ishank/.local/lib/python2.7/site-packages/google/cloud/bigquery/job.py", line 2415, in result
    super(QueryJob, self).result(timeout=timeout)
  File "/home/ishank/.local/lib/python2.7/site-packages/google/cloud/bigquery/job.py", line 660, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "/home/ishank/.local/lib/python2.7/site-packages/google/api_core/future/polling.py", line 120, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Cannot explicitly modify anonymous table xyz:_bf4dfedaed165b3ee62d8a9efa.anon1db6c519_b4ff_dbc67c17659f

编辑1: 下面是一个抛出上述异常的示例查询。此外,这在bigquery控制台中运行顺畅。

select productId, eventType, count(*) as count from [xyz:xyz.abc] where productId in ("168561","175936","161684","161681","161686") and eventTime > CAST("2018-05-30 11:21:19" as DATETIME) group by eventType, productId order by productId;

2 个答案:

答案 0 :(得分:6)

我有完全相同的问题。问题不在于查询本身,而是您最有可能重复使用相同的Warning: the following output files of rule process_x_only were not present when the DAG was created: {'processed_x.txt'} 。执行查询时,除非设置QueryJobConfig,否则BigQuery会将结果存储在destination对象中声明的匿名表中。如果重用此配置,BigQuery会尝试将新结果存储在同一个匿名表中,从而导致错误。 说实话,我并不特别喜欢这种行为。

您应该像这样重写代码:

QueryJobConfig

希望这有帮助!

答案 1 :(得分:1)

编辑:

Federico Bertola在解决方案和BigQuery see this link写入的临时表上是正确的。

我上次从公共表格查询示例代码时没有收到错误,但我今天可以重现错误,因此这种症状可能会出现间歇性问题。我可以通过Federico的建议确认错误已得到解决。

当查询字符串缺少查询中的参数引号时,您可以获得“super(QueryJob,self).result(timeout = timeout)”错误。您的查询中的参数format_strings似乎也犯了类似的错误。您可以通过确保参数周围有引号转义来解决此问题:

(" + myparam + ")

,应该写成

(\"" + myparam + "\")

您应该检查使用参数的查询字符串,并从更简单的查询开始,例如

select productId, eventType, count(*) as count from `xyz:xyz.abc`

,随时随地增加查询。

为了记录,这对我有用:

from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.QueryJobConfig()

def get_data_from_bq(myparam):  
    query = "SELECT word, SUM(word_count) as count FROM `publicdata.samples.shakespeare` WHERE word IN (\""+myparam+"\") GROUP BY word;"
    query_job = client.query(query, job_config=job_config) 
    return query_job.result()

mypar = "raisin"
x = 1
while (x<9):
    iterator = get_data_from_bq(mypar)
    print "==%d iteration==" % x
    x += 1