我有大约4GB的数据存储在Google的BigQuery中,格式如下:
uuid | entity_name | property | value
---------------------------------------------------------------
abc | Person | first_name | John
def | Person | age | 45
abc | Person | age | 26
def | Person | first_name | Mary
...
我想获得由uuid排序的分页结果。但是,根据documentation,,当标志" allowLargeResults"时,不能使用ORDER BY或GROUP BY。设置为true。当然,查询这样大的表需要这样做。这种情况有解决方法吗?我尝试进行客户端排序,但是在成功获取前几页后,它会引发错误"现有连接被远程主机强行关闭"。
这是我的查询工作:
query = 'SELECT * FROM [Users.Events] ORDER BY uuid'
query_request = {
'jobReference': {
'projectId': project_id,
'job_id': str(uuid.uuid4())
},
'configuration': {
'query': {
'query': query,
'priority': 'BATCH' if BATCH_QUERY else 'INTERACTIVE',
'allowLargeResults' : True,
'destinationTable': {
'projectId': project_id,
'datasetId': 'CrunchBase',
'tableId': 'AllProperties_query'
},
'createDisposition': 'CREATE_IF_NEEDED',
'writeDisposition': 'WRITE_TRUNCATE',
}
}
}
query_job = service.jobs().insert(
projectId=project_id,
body=query_request).execute(num_retries=2)
poll_job(service, query_job)
结果:
RuntimeError: {u'reason': u'resourcesExceeded', u'message': u'Resources exceeded during query execution.', u'location': u'query'}
编辑:尝试在分区内排序
如果我弄清楚如何按entity_name进行分区并按uuid排序,我可以解决问题,但以下查询不起作用:
SELECT
uuid, entity_name, property, value
OVER
(PARTITION BY entity_name ORDER BY uuid) AS entities
FROM [CrunchBase.AllProperties];
结果:
Query Failed
Error: Missing function in Analytic Expression at: 1.15 - 1.70
答案 0 :(得分:2)
要回答编辑中的问题,您需要实际指定要应用于该有序分区的分析函数。由于您只想要每行的当前值,因此可以使用lead(x, 0)
。
对于您的查询,您可以这样写:
SELECT
uuid, entity_name,
LEAD(property, 0) OVER (PARTITION BY entity_name ORDER BY uuid) AS cur_property,
LEAD(value, 0) OVER (PARTITION BY entity_name ORDER BY uuid) AS cur_value,
FROM [CrunchBase.AllProperties]