如何在BigQuery中对大表进行排序?

时间:2016-03-09 07:24:13

标签: sql google-bigquery

我有大约4GB的数据存储在Google的BigQuery中,格式如下:

   uuid    |   entity_name    |    property    |    value   
---------------------------------------------------------------
  abc      |   Person         |   first_name   |  John
  def      |   Person         |   age          |  45
  abc      |   Person         |   age          |  26
  def      |   Person         |   first_name   |  Mary
...

我想获得由uuid排序的分页结果。但是,根据documentation,当标志" allowLargeResults"时,不能使用ORDER BY或GROUP BY。设置为true。当然,查询这样大的表需要这样做。这种情况有解决方法吗?我尝试进行客户端排序,但是在成功获取前几页后,它会引发错误"现有连接被远程主机强行关闭"。

这是我的查询工作:

query = 'SELECT * FROM [Users.Events] ORDER BY uuid'

query_request = {
    'jobReference': {
        'projectId': project_id,
        'job_id': str(uuid.uuid4())
    },
    'configuration': {
        'query': {
            'query': query,
            'priority': 'BATCH' if BATCH_QUERY else 'INTERACTIVE',
            'allowLargeResults' : True,
            'destinationTable': {
                'projectId': project_id,
                'datasetId': 'CrunchBase',
                'tableId': 'AllProperties_query'
            },
            'createDisposition': 'CREATE_IF_NEEDED',
            'writeDisposition': 'WRITE_TRUNCATE',
        }
    }
}

query_job = service.jobs().insert(
    projectId=project_id,
    body=query_request).execute(num_retries=2)

poll_job(service, query_job)

结果:

RuntimeError: {u'reason': u'resourcesExceeded', u'message': u'Resources exceeded during query execution.', u'location': u'query'}

编辑:尝试在分区内排序

如果我弄清楚如何按entity_name进行分区并按uuid排序,我可以解决问题,但以下查询不起作用:

SELECT
  uuid, entity_name, property, value
OVER
  (PARTITION BY entity_name ORDER BY uuid) AS entities
FROM [CrunchBase.AllProperties];

结果:

Query Failed
Error: Missing function in Analytic Expression at: 1.15 - 1.70

1 个答案:

答案 0 :(得分:2)

要回答编辑中的问题,您需要实际指定要应用于该有序分区的分析函数。由于您只想要每行的当前值,因此可以使用lead(x, 0)

对于您的查询,您可以这样写:

SELECT
  uuid, entity_name,
  LEAD(property, 0) OVER (PARTITION BY entity_name ORDER BY uuid) AS cur_property,
  LEAD(value, 0) OVER (PARTITION BY entity_name ORDER BY uuid) AS cur_value,
FROM [CrunchBase.AllProperties]