如何处理多个关联的日志消息?

时间:2015-04-16 10:03:26

标签: sql logging hive analytics pivot-table

我有一项为项目计算一堆东西的服务。用户可以每天多次触发此计算。每次计算都会生成一些有趣的指标(我们称之为A,B,C)。

我将这些指标报告给具有单独日志消息的日志服务。日志消息如下所示:

date  |  calculationID1  |  projectID1  |  metricA  |  valueA
date  |  calculationID1  |  projectID1  |  metricB  |  valueB
date  |  calculationID1  |  projectID1  |  metricC  |  valueC
date  |  calculationID2  |  projectID2  |  metricA  |  valueA
date  |  calculationID2  |  projectID2  |  metricB  |  valueB
date  |  calculationID2  |  projectID2  |  metricC  |  valueC
date  |  calculationID3  |  projectID1  |  metricA  |  valueA
date  |  calculationID3  |  projectID1  |  metricB  |  valueB
date  |  calculationID3  |  projectID1  |  metricC  |  valueC

在此示例中,ID为1的项目在此特定日期运行了两次。在我的分析后端,我有一个Hive集群来分析这些数据,我想生成一个表,其中包含给定日期的每个项目的最后报告指标:

date  |  calculationID3  |  projectID1  |  valueA  |  valueB  |  valueC
date  |  calculationID2  |  projectID2  |  valueA  |  valueB  |  valueC

显然,这种计算非常昂贵,因为我做了大量的连接。我的公司有严格的日志记录格式,这就是我为每条日志消息创建一个值的原因。我是否应该创建一条包含所有指标的日志消息,以简化报告?

有人能指出我对这类问题的最佳做法吗?

1 个答案:

答案 0 :(得分:0)

如果我们使用DB,在SQL中支持PIVOT clause,那么我们可以使用以下查询从日志报告中收集数据。

可以在没有PIVOT的情况下获取相同的结果,但另一种方式需要大量的复制粘贴和杂耍,并且因为你是"pragmatic with implementation",我想我们不需要谈论那些脏东西。

要查看查询中发生了什么,您可以执行3个步骤:

  • 运行不带PIVOT的查询(只需删除PIVOT关键字及其余代码)
  • 然后按原样运行
  • 比较第一步和第二步的结果,识别行如何转置到列
WITH
  data_table (stamp, calculation_ID, project_ID, metric_name, metric_value) as ( select

      timestamp '2015-01-01 00:00:01', 'calc_ID_1', 'project_WHITE', 'metric_A', 11 from dual union all select
      timestamp '2015-01-01 00:00:02', 'calc_ID_1', 'project_WHITE', 'metric_B', 21 from dual union all select
      timestamp '2015-01-01 00:00:03', 'calc_ID_1', 'project_WHITE', 'metric_C', 31 from dual union all select
      timestamp '2015-01-01 00:01:04', 'calc_ID_2', 'project_WHITE', 'metric_A', 12 from dual union all select
      timestamp '2015-01-01 00:01:05', 'calc_ID_2', 'project_WHITE', 'metric_B', 22 from dual union all select
      timestamp '2015-01-01 00:01:06', 'calc_ID_2', 'project_WHITE', 'metric_C', 32 from dual union all select

      timestamp '2015-01-01 00:00:11', 'calc_ID_3', 'project_BLACK', 'metric_A', 41 from dual union all select
      timestamp '2015-01-01 00:00:12', 'calc_ID_3', 'project_BLACK', 'metric_B', 51 from dual union all select
      timestamp '2015-01-01 00:00:13', 'calc_ID_3', 'project_BLACK', 'metric_C', 61 from dual union all select
      timestamp '2015-01-01 00:01:14', 'calc_ID_4', 'project_BLACK', 'metric_A', 42 from dual union all select
      timestamp '2015-01-01 00:01:15', 'calc_ID_4', 'project_BLACK', 'metric_B', 52 from dual union all select
      timestamp '2015-01-01 00:01:16', 'calc_ID_4', 'project_BLACK', 'metric_C', 62 from dual       
  )
SELECT *
  FROM (
      select trunc(stamp) AS day,
             calculation_id,
             project_id,
             metric_name,
             metric_value
       from (
         select t.*,
                rank() OVER (PARTITION BY project_ID, metric_name, trunc(stamp) ORDER BY stamp DESC) calculation_rank
         from data_table t
        -- take only the last log row for (project_ID, metric_name) for every given day
      ) where calculation_rank = 1
)
PIVOT (
  -- aggregate function is required here,
  -- and SUM can be replaced with something more relevant to custom logic
  SUM(metric_value)
  FOR
  metric_name IN ('metric_A' AS "Metric A",
                  'metric_B' AS "Metric B",
                  'metric_C' AS "Metric C")
);

结果:

 DAY        | CALCULATION_ID | PROJECT_ID    | Metric A | Metric B | Metric C
------------------------------------------------------------------------------
 2015-01-01 | calc_ID_4      | project_BLACK | 42       | 52       | 62
 2015-01-01 | calc_ID_2      | project_WHITE | 12       | 22       | 32

在这个查询中calculation_ID是多余的(我只使用它来使代码阅读器更清楚)。但是您仍然可以应用此信息来检查日志记录数据格式的完整性,探索是否相等calculation_ID对应于同一组/时间段中涉及的度量标准。