Google的BigQuery Storage API可以从仅涉及SELECT,FROM和WHERE的基本查询创建的临时表中读取。
我看到的是,当您使用ORDER BY语句检索行的有序集合时,从BigQuery存储API无法读取所创建的临时表。
看看下面的代码示例:
让我们进行以下查询:
sql = """SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` LIMIT 1000"""
如果使用此BigQuery python API代码运行它:
bq_client = bigquery.Client("myproject") ## << Change to your project
query_job = bq_client.query(
sql,
location='US')
project_id = query_job.destination.project
dataset_id = query_job.destination.dataset_id
table_id = query_job.destination.table_id
print("Destination table: " + project_id + "." + dataset_id + "." + table_id)
...然后您将获得目标表。
您可以在此处将此目标表传递给BigQuery Storage API,以使用RPC获得结果:
client = bigquery_storage_v1beta1.BigQueryStorageClient()
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = project_id
table_ref.dataset_id = dataset_id
table_ref.table_id = table_id
read_options = bigquery_storage_v1beta1.types.TableReadOptions()
read_options.selected_fields.append("name")
parent = "projects/{}".format(project_id)
session = client.create_read_session(
table_ref, parent, table_modifiers=modifiers, read_options=read_options
) # API request.
reader = client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[0])
)
rows = reader.rows(session)
这很好。
现在将sql= <yourquery>
中的查询更改为
sql = """SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current` ORDER BY name ASC LIMIT 1000"""
,您将从代码的BigQuery Storage API部分得到以下错误:
Table 'myproject:mydataset.temptable' has a storage format that is not supported.
这意味着查询中的ORDER BY语句增加了某种复杂性,使得临时表对存储API不可读。
问题: 1)关于如何解决此问题的任何想法,或者这是对存储API的真正限制吗? 2)如果ORDER BY产生问题,查询的全部范围是多少 会为存储API创建不可读的临时表?
答案 0 :(得分:0)
我们可以使用 bigquery_storage.BigQueryReadClient 读取由 order by、join 等查询创建的临时表。下面是工作代码。 我已经使用 join 创建了临时表。
from google.cloud.bigquery_storage import BigQueryReadClient
from google.cloud.bigquery_storage import types, ReadRowsResponse
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
client = bigquery_storage.BigQueryReadClient(credentials=credentials)
try:
import fastavro
except ImportError:
fastavro = None
sql = """SELECT s.id, s.name, d.dept FROM sbx-test.EMP.emp01 s join sbx-test.EMP.dept d
on s.id = d.id"""
query_job = bqclient.query(sql)
project_id = query_job.destination.project
dataset_id = query_job.destination.dataset_id
table_id = query_job.destination.table_id
table = "projects/{}/datasets/{}/tables/{}".format(
project_id, dataset_id, table_id
)
requested_session = types.ReadSession()
requested_session.table = table
requested_session.data_format = types.DataFormat.AVRO
requested_session.read_options.selected_fields = ["name", "dept"]
parent = "projects/{}".format(project_id)
session = client.create_read_session(
parent=parent,
read_session=requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
rows = reader.rows(session)
names = set()
depts = set()
for row in rows:
names.add(row["name"])
depts.add(row["dept"])
print("Got unique employees {} and departments {}".format(names, depts))