BigQuery Storage API无法从有序(ORDER BY)查询创建的临时表中读取

时间:2019-05-18 14:27:42

标签: api google-bigquery storage

Google的BigQuery Storage API可以从仅涉及SELECT,FROM和WHERE的基本查询创建的临时表中读取。

我看到的是,当您使用ORDER BY语句检索行的有序集合时,从BigQuery存储API无法读取所创建的临时表。

看看下面的代码示例:

让我们进行以下查询:

sql = """SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` LIMIT 1000"""

如果使用此BigQuery python API代码运行它:

bq_client = bigquery.Client("myproject") ## << Change to your project

query_job = bq_client.query(
    sql,
    location='US')  

project_id = query_job.destination.project
dataset_id = query_job.destination.dataset_id
table_id = query_job.destination.table_id

print("Destination table: " + project_id + "." + dataset_id + "." + table_id)

...然后您将获得目标表。

您可以在此处将此目标表传递给BigQuery Storage API,以使用RPC获得结果:


client = bigquery_storage_v1beta1.BigQueryStorageClient()

table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = project_id
table_ref.dataset_id = dataset_id
table_ref.table_id = table_id

read_options = bigquery_storage_v1beta1.types.TableReadOptions()
read_options.selected_fields.append("name")

parent = "projects/{}".format(project_id)
session = client.create_read_session(
    table_ref, parent, table_modifiers=modifiers, read_options=read_options
)  # API request.

reader = client.read_rows(
    bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[0])
)

rows = reader.rows(session)

这很好。

现在将sql= <yourquery>中的查询更改为

sql = """SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` ORDER BY name ASC LIMIT 1000"""

,您将从代码的BigQuery Storage API部分得到以下错误:

Table 'myproject:mydataset.temptable' has a storage format that is not supported.

这意味着查询中的ORDER BY语句增加了某种复杂性,使得临时表对存储API不可读。

问题: 1)关于如何解决此问题的任何想法,或者这是对存储API的真正限制吗? 2)如果ORDER BY产生问题,查询的全部范围是多少 会为存储API创建不可读的临时表?

1 个答案:

答案 0 :(得分:0)

我们可以使用 bigquery_storage.BigQueryReadClient 读取由 order by、join 等查询创建的临时表。下面是工作代码。 我已经使用 join 创建了临时表。

from google.cloud.bigquery_storage import BigQueryReadClient
from google.cloud.bigquery_storage import types, ReadRowsResponse

bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
client = bigquery_storage.BigQueryReadClient(credentials=credentials)

try:
    import fastavro
except ImportError:  
    fastavro = None

sql = """SELECT s.id, s.name, d.dept  FROM sbx-test.EMP.emp01 s join sbx-test.EMP.dept d 
on s.id = d.id"""
query_job = bqclient.query(sql)

project_id = query_job.destination.project
dataset_id = query_job.destination.dataset_id
table_id = query_job.destination.table_id

table = "projects/{}/datasets/{}/tables/{}".format(
    project_id, dataset_id, table_id
)
requested_session = types.ReadSession()
requested_session.table = table
requested_session.data_format = types.DataFormat.AVRO

requested_session.read_options.selected_fields = ["name", "dept"]

parent = "projects/{}".format(project_id)
session = client.create_read_session(
    parent=parent,
    read_session=requested_session,
    max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
rows = reader.rows(session)

names = set()
depts = set()
for row in rows:
    names.add(row["name"])
    depts.add(row["dept"])
    
print("Got unique employees {} and departments {}".format(names, depts))