Question

我的查询如下：

SELECT
    e.event_id,
    (
        SELECT
            event_id
        FROM atomic.events
        WHERE
            domain_userid = e.domain_userid
        ORDER BY collector_tstamp
        LIMIT 1
    ) AS parent_event_id
FROM snowplow_intermediary.events_enriched e
LIMIT 1

我试图为每个用户找到第一个事件。这相当快〜5s。如果我试图通过user_ipaddress而不是domain_userid来查找用户，那么速度会慢一些。在300多岁之后，它还没有完成。

SELECT
    e.event_id,
    (
        SELECT
            event_id
        FROM atomic.events
        WHERE
            user_ipaddress = e.user_ipaddress
        ORDER BY collector_tstamp
        LIMIT 1
    ) AS parent_event_id
FROM snowplow_intermediary.events_enriched e
LIMIT 1

数据类型为domain_userid varchar(36) encode runlength和user_ipaddress varchar(45) encode runlength。

以下是对查询的说明：

https://gist.github.com/mortenstarfly/4ce3be9b3a19aac2601a

https://gist.github.com/mortenstarfly/2008b0f737259df30695

我真的想加快第二次查询。有什么建议吗？

Answer 1

这可能是因为你的短键..如果您的数据是根据用户ID排序的，那么数据将被快速检索（对于第一个查询），因为红移将知道您的数据驻留在哪个段（基于区域地图）并且可以跳过很多切片，你的io将会非常低。

Redshift慢子查询（如果它包含某些列）

1 个答案: