我们在2对不同的表(用于检索更新的数据和内容的几个连接),相同(包括表和查询)格式和目标上执行了2个查询。查询之间的差异是包含在REGEXP_MATCH字符串中的字符串变量(正则表达式具有相同的格式,只是不同的核心字符串)。其余部分完全相同,当然除了包含的数据。
即使查询在一对表上在20-50秒内处理了几乎2GB的数据,但是同一查询使用不同的REGEX参数(同一列)在100多秒内处理另一对表250 MB(有时甚至是500至1000+秒)。两个查询都以交互模式执行,而不缓存结果。
可能是什么原因导致修复?
考虑到运行的查询基本相同,较小的表与大得多的表相比,如何才能需要大量的处理时间?
对不起下面的混乱,试图让它尽可能漂亮。 因此,对于简短的简报:查询打算根据其事件创建用户漏斗。数据是实时的,所以我们有更新的用户和事件。包括的步骤如下:
如果您需要其他详细信息,请与我们联系。我会尽量让一切尽可能清楚。
SELECT Count(*) as count
FROM
(
SELECT final._nid as _nid
FROM (
-- Start of events funnel
SELECT did.user as user
FROM (
-- Get updated events
SELECT events.user as user, events.createdOn as createdOn
FROM [shop1_events] as events
JOIN EACH (
SELECT session, createdOn, MAX(updatedOn) as updatedOn
FROM [shop1_events]
GROUP EACH BY session, createdOn) as latest_events
ON events.session = latest_events.session AND events.createdOn = latest_events.createdOn AND events.updatedOn = latest_events.updatedOn
-- Regex for categories (concatenated categories)
WHERE ((REGEXP_MATCH(events.category_a , r"([\:\^]100000453[\:\^]|^100000453$|^100000453[\^\:]|[\^\:]100000453$)"))) AND events.type = 10006) as did
-- Exclude the following events:
LEFT OUTER JOIN EACH (
-- Get updated events
SELECT events.user as user, events.createdOn as createdOn
FROM [shop1_events] as events
JOIN EACH (
SELECT session, createdOn, MAX(updatedOn) as updatedOn
FROM [shop1_events]
GROUP EACH BY session, createdOn) as latest_events
ON events.session = latest_events.session AND events.createdOn = latest_events.createdOn AND events.updatedOn = latest_events.updatedOn
-- Regex for categories
WHERE ((REGEXP_MATCH(events.category_a , r"([\:\^]100000485[\:\^]|^100000485$|^100000485[\^\:]|[\^\:]100000485$)"))) AND events.type = 10006) as step_not_0
ON did.user = step_not_0.user
WHERE step_not_0.user IS NULL) as funnel
JOIN EACH (
-- Join with users
SELECT all._nid as _nid
FROM [shop1_users] as all
JOIN EACH (
-- Get updated users
SELECT _nid, MAX(updatedOn) as updatedOn
FROM [shop1_users]
GROUP EACH BY _nid) as latest
ON all._nid = latest._nid AND all.updatedOn = latest.updatedOn
) as final
ON final._nid = funnel.user
GROUP EACH BY _nid) as counting;
答案 0 :(得分:0)
也许这会有所帮助:
您可以提取重要的值并将其与您的值进行比较,而不是运行正则表达式:
示例:
替换此
REGEXP_MATCH(events.category_a , r"([\:\^]100000485[\:\^]|^100000485$|^100000485[\^\:]|[\^\:]100000485$)")))
为此:
REGEXP_REPLACE(events.category_a, r"\D*(\d+)\D*", (\1)) = "100000485"
答案 1 :(得分:0)
处理的数据量仅表示第一阶段读取的数据量。但是,查询的成本可能与此没有直接关系。例如。由于没有匹配,JOIN或WHERE子句可能会消除一个案例中的大多数记录,但在另一个案例中留下记录。或者在一个表中可能存在一些偏差(特定的JOIN键发生很多) - 这会导致查询运行缓慢。影响绩效的因素很多。
您可以执行的操作 - 运行查询后,单击Explanation
按钮,检查每种情况下大部分时间占用的步骤,以及此步骤处理的行数。除了从原始表中读取的字节数之外,这些值还可以更深入地了解查询性能。
P.S。查询也可能受益于称为过滤器下推的小重写:
SELECT events.user as user, events.createdOn as createdOn
FROM [shop1_events] as events
JOIN EACH (...) as latest_events
ON ...
WHERE <condition-for-events-table>
向
SELECT events.user as user, events.createdOn as createdOn
FROM (SELECT <fields-used> FROM [shop1_events]
WHERE <condition-for-events-table>) as events
JOIN EACH (...) as latest_events
ON ...