我有一项为项目计算一堆东西的服务。用户可以每天多次触发此计算。每次计算都会生成一些有趣的指标(我们称之为A,B,C)。
我将这些指标报告给具有单独日志消息的日志服务。日志消息如下所示:
date | calculationID1 | projectID1 | metricA | valueA
date | calculationID1 | projectID1 | metricB | valueB
date | calculationID1 | projectID1 | metricC | valueC
date | calculationID2 | projectID2 | metricA | valueA
date | calculationID2 | projectID2 | metricB | valueB
date | calculationID2 | projectID2 | metricC | valueC
date | calculationID3 | projectID1 | metricA | valueA
date | calculationID3 | projectID1 | metricB | valueB
date | calculationID3 | projectID1 | metricC | valueC
在此示例中,ID为1的项目在此特定日期运行了两次。在我的分析后端,我有一个Hive集群来分析这些数据,我想生成一个表,其中包含给定日期的每个项目的最后报告指标:
date | calculationID3 | projectID1 | valueA | valueB | valueC
date | calculationID2 | projectID2 | valueA | valueB | valueC
显然,这种计算非常昂贵,因为我做了大量的连接。我的公司有严格的日志记录格式,这就是我为每条日志消息创建一个值的原因。我是否应该创建一条包含所有指标的日志消息,以简化报告?
有人能指出我对这类问题的最佳做法吗?
答案 0 :(得分:0)
如果我们使用DB,在SQL中支持PIVOT clause,那么我们可以使用以下查询从日志报告中收集数据。
可以在没有PIVOT
的情况下获取相同的结果,但另一种方式需要大量的复制粘贴和杂耍,并且因为你是"pragmatic with implementation",我想我们不需要谈论那些脏东西。
要查看查询中发生了什么,您可以执行3个步骤:
PIVOT
的查询(只需删除PIVOT
关键字及其余代码)WITH
data_table (stamp, calculation_ID, project_ID, metric_name, metric_value) as ( select
timestamp '2015-01-01 00:00:01', 'calc_ID_1', 'project_WHITE', 'metric_A', 11 from dual union all select
timestamp '2015-01-01 00:00:02', 'calc_ID_1', 'project_WHITE', 'metric_B', 21 from dual union all select
timestamp '2015-01-01 00:00:03', 'calc_ID_1', 'project_WHITE', 'metric_C', 31 from dual union all select
timestamp '2015-01-01 00:01:04', 'calc_ID_2', 'project_WHITE', 'metric_A', 12 from dual union all select
timestamp '2015-01-01 00:01:05', 'calc_ID_2', 'project_WHITE', 'metric_B', 22 from dual union all select
timestamp '2015-01-01 00:01:06', 'calc_ID_2', 'project_WHITE', 'metric_C', 32 from dual union all select
timestamp '2015-01-01 00:00:11', 'calc_ID_3', 'project_BLACK', 'metric_A', 41 from dual union all select
timestamp '2015-01-01 00:00:12', 'calc_ID_3', 'project_BLACK', 'metric_B', 51 from dual union all select
timestamp '2015-01-01 00:00:13', 'calc_ID_3', 'project_BLACK', 'metric_C', 61 from dual union all select
timestamp '2015-01-01 00:01:14', 'calc_ID_4', 'project_BLACK', 'metric_A', 42 from dual union all select
timestamp '2015-01-01 00:01:15', 'calc_ID_4', 'project_BLACK', 'metric_B', 52 from dual union all select
timestamp '2015-01-01 00:01:16', 'calc_ID_4', 'project_BLACK', 'metric_C', 62 from dual
)
SELECT *
FROM (
select trunc(stamp) AS day,
calculation_id,
project_id,
metric_name,
metric_value
from (
select t.*,
rank() OVER (PARTITION BY project_ID, metric_name, trunc(stamp) ORDER BY stamp DESC) calculation_rank
from data_table t
-- take only the last log row for (project_ID, metric_name) for every given day
) where calculation_rank = 1
)
PIVOT (
-- aggregate function is required here,
-- and SUM can be replaced with something more relevant to custom logic
SUM(metric_value)
FOR
metric_name IN ('metric_A' AS "Metric A",
'metric_B' AS "Metric B",
'metric_C' AS "Metric C")
);
结果:
DAY | CALCULATION_ID | PROJECT_ID | Metric A | Metric B | Metric C
------------------------------------------------------------------------------
2015-01-01 | calc_ID_4 | project_BLACK | 42 | 52 | 62
2015-01-01 | calc_ID_2 | project_WHITE | 12 | 22 | 32
在这个查询中calculation_ID
是多余的(我只使用它来使代码阅读器更清楚)。但是您仍然可以应用此信息来检查日志记录数据格式的完整性,探索是否相等calculation_ID
对应于同一组/时间段中涉及的度量标准。