大问题:如何在表中获得前20个相关性?

时间:2017-04-02 22:13:39

标签: sql google-bigquery amazon-redshift correlation bigdata

我有100,000个时间序列文件,每个文件有2列,日期和值。我将在Google BigQuery中创建一个表,并将所有时间序列附加到此表,以便每个追加将扩展3列,time_series_name,date,value。最后,我将有3列数百万行。在给定time_series_name的情况下,我必须将哪些代码用于前20个相关时间序列。我想我必须做一些GROUPBY(time_series_name),然后计算这个time_series_name与其他项目的相关性,然后通过降序对项目进行排序。是对的吗?什么查询代码可以做到这一点?

1 个答案:

答案 0 :(得分:2)

尝试以下,

它假设您的表格名为all_time_series,其字段为time_series_namedtvalue,并按照您在问题中描述的逻辑构建

  
#standardSQL
WITH series AS (
  SELECT DISTINCT time_series_name 
  FROM all_time_series
),
pairs AS (
  SELECT 
    series1.time_series_name AS time_series_1, 
    series2.time_series_name AS time_series_2,
    CONCAT(series1.time_series_name, ' - ', series2.time_series_name) AS pair_name 
  FROM series AS series1
  JOIN series AS series2
  ON series1.time_series_name < series2.time_series_name
) 
SELECT pair_name, CORR(value1, value2) AS correlation
FROM (
  SELECT pair_name, a1.dt AS dt, a1.value AS value1, a2.value AS value2
  FROM pairs AS p
  JOIN all_time_series AS a1 
    ON p.time_series_1 = a1.time_series_name
  JOIN all_time_series AS a2 
    ON p.time_series_2 = a2.time_series_name
    AND a1.dt = a2.dt
)
GROUP BY pair_name
ORDER BY correlation DESC
LIMIT 20  

您可以使用虚拟数据进行上述测试,如下所示

#standardSQL
WITH all_time_series AS (
  SELECT 'a' AS time_series_name, '2016-01-01' AS dt, 1 AS value UNION ALL
  SELECT 'a', '2016-01-02', 2 UNION ALL
  SELECT 'a', '2016-01-03', 3 UNION ALL

  SELECT 'b', '2016-01-01', 1 UNION ALL
  SELECT 'b', '2016-01-02', 2 UNION ALL
  SELECT 'b', '2016-01-03', 3 UNION ALL

  SELECT 'c', '2016-01-01', 5 UNION ALL
  SELECT 'c', '2016-01-02', 6 UNION ALL
  SELECT 'c', '2016-01-03', 7 UNION ALL

  SELECT 'd', '2016-01-01', 6 UNION ALL
  SELECT 'd', '2016-01-02', 2 UNION ALL
  SELECT 'd', '2016-01-03', 3
),
series AS (
  SELECT DISTINCT time_series_name 
  FROM all_time_series
),
pairs AS (
  SELECT 
    series1.time_series_name AS time_series_1, 
    series2.time_series_name AS time_series_2,
    CONCAT(series1.time_series_name, ' - ', series2.time_series_name) AS pair_name 
  FROM series AS series1
  JOIN series AS series2
  ON series1.time_series_name < series2.time_series_name
) 
SELECT pair_name, CORR(value1, value2) AS correlation
FROM (
  SELECT pair_name, a1.dt AS dt, a1.value AS value1, a2.value AS value2
  FROM pairs AS p
  JOIN all_time_series AS a1 
    ON p.time_series_1 = a1.time_series_name
  JOIN all_time_series AS a2 
    ON p.time_series_2 = a2.time_series_name
    AND a1.dt = a2.dt
)
GROUP BY pair_name
ORDER BY correlation DESC
LIMIT 2