我有一张桌子
CREATE TABLE StatsFull (
Timestamp Int32,
Uid String,
ErrorCode Int32,
Name String,
Version String,
Date Date MATERIALIZED toDate(Timestamp),
Time DateTime MATERIALIZED toDateTime(Timestamp)
) ENGINE = MergeTree() PARTITION BY toMonday(Date)
ORDER BY Time SETTINGS index_granularity = 8192
我需要获得具有唯一Uid的前100个名称或前100个错误代码。
显而易见的查询是
SELECT Name, uniq(PcId) as cnt FROM StatsFull
WHERE Time > subtractDays(toDate(now()), 1)
GROUP BY Name ORDER BY cnt DESC LIMIT 100
但是数据太大,所以我创建了一个AggregatingMergeTree,因为我不需要按小时(按日期)过滤数据。
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() PARTITION BY toMonday(Date)
ORDER BY
(
Date,
ProductName,
ErrorCode,
Name,
Version
) SETTINGS index_granularity = 8192 AS
SELECT
Date,
ProductName,
ErrorCode,
Name,
Version,
uniqState(Uid) AS UniqUsers,
FROM
StatsFull
GROUP BY
Date,
ProductName,
ErrorCode,
Name,
Version
我当前的查询是:
SELECT Name FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
该查询工作正常,但是最终一天中的数据行变得越来越多,现在它对内存也过于贪婪。所以我正在寻找一些优化。
我发现了函数 topK(N)(column),该函数返回指定列中最频繁出现的值的数组,但这不是我所需要的。
答案 0 :(得分:1)
我建议以下几点:
使用uniqCombined / uniqCombined64与uniq
减少聚合视图中的尺寸计数(看起来可以省略 ProductName 和 Version )
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
Name String,
ErrorCode Int32
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
ORDER BY (Date, Name, ErrorCode) AS
SELECT Date, Name, ErrorCode, uniqState(Uid) AS UniqUsers,
FROM StatsFull
GROUP BY Date, Name, ErrorCode;
SELECT Name, uniqMerge(UniqUsers) uniqUsers
FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
AND uniqUsers > 12345 /* <-- 12345 is 'heuristic' number that you evaluate based on your data */
AND ErrorCode = 0 /* apply any other conditions to narrow the result set as short as possible */
GROUP BY Name
ORDER BY uniqUsers DESC LIMIT 100
/* Raw-table */
CREATE TABLE StatsFull (
/* .. */
) ENGINE = MergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY xxHash32(Uid) /* < -- */
ORDER BY Time, xxHash32(Uid)
/* Applying sampling to raw-table can make faster the short-term queries (period in several hours etc) */
SELECT Name, uniq(PcId) as cnt
FROM StatsFull
SAMPLE 0.05 /* <-- */
WHERE Time > subtractHours(now(), 6) /* <-- hours-period */
GROUP BY Name
ORDER BY cnt DESC LIMIT 100
/* Aggregated-table */
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY intHash32(toInt32(Date)) /* < -- not sure that is good to choose */
ORDER BY (intHash32(toInt32(Date)), ProductName, ErrorCode, Name, Version)
SELECT /* .. */ FROM StatsFull GROUP BY /* .. */**
/* Applying sampling to aggregated-table can make faster the long-term queries (period in several weeks, months etc) */
SELECT Name
FROM StatsAggregated
SAMPLE 0.1 /* < -- */
WHERE Date > subtractMonths(toDate(now()), 3) /* <-- months-period */
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
将数据分为几个部分(碎片)可以进行分布式处理。
答案 1 :(得分:0)
如果您需要将transpone数组转换为行,则可以使用arrayJoin
IObservable<T>