按分片分组:基数太高或查询超时

时间:2015-06-11 09:52:55

标签: google-bigquery

我在巨大的表上进行GROUP EACH BY和JOIN EACH查询。因为分组键的基数太高,我会像这样“查询”查询:(简化示例)

SELECT key FROM 
  (SELECT key FROM [table] WHERE ABS(HASH(key) % 2 = 0),
  (SELECT key FROM [table] WHERE ABS(HASH(key)) % 2 = 1)

现在的问题是我的表需要这么多的分片来保持分组键不会太大,整个查询变得太慢而导致超时。

我可以通过将所有分片作为单独的查询运行并将中间结果存储在临时表中来解决此问题。但我真的想解决这个问题,而不必创建额外的表(这将导致额外的成本)。

有什么建议吗?

以下是其中一个有此问题的查询(我使用coffeescript生成分片查询)

"SELECT * FROM " +
(
    """
    (SELECT
        key, NEST(things) as things, FIRST(category) as category
    FROM
        (SELECT
            things, key, category, events,
            RATIO_TO_REPORT(events) OVER (PARTITION BY key) AS presence
        FROM (
            SELECT
                a.key as key, a.category as category
                a.things as things, a.events as events
            FROM [table1] a
            JOIN EACH (
                SELECT key FROM [table2]
                WHERE things BETWEEN 2 AND 10 AND ABS(HASH(key)) % #{shards} = #{shard}
            ) b
            ON a.key = b.key
            WHERE ABS(HASH(a.key)) % #{shards} = #{shard}))
    WHERE presence > 0.1 AND ABS(HASH(key)) % #{shards} = #{shard}
    GROUP EACH BY key
    HAVING COUNT(things) > 1)
    """ for shard in [0..(shards-1)]
).join ','

0 个答案:

没有答案