我有100,000个时间序列文件,每个文件有2列,日期和值。我将在Google BigQuery中创建一个表,并将所有时间序列附加到此表,以便每个追加将扩展3列,time_series_name,date,value。最后,我将有3列数百万行。在给定time_series_name的情况下,我必须将哪些代码用于前20个相关时间序列。我想我必须做一些GROUPBY(time_series_name),然后计算这个time_series_name与其他项目的相关性,然后通过降序对项目进行排序。是对的吗?什么查询代码可以做到这一点?
答案 0 :(得分:2)
尝试以下,
它假设您的表格名为all_time_series
,其字段为time_series_name
,dt
和value
,并按照您在问题中描述的逻辑构建
#standardSQL
WITH series AS (
SELECT DISTINCT time_series_name
FROM all_time_series
),
pairs AS (
SELECT
series1.time_series_name AS time_series_1,
series2.time_series_name AS time_series_2,
CONCAT(series1.time_series_name, ' - ', series2.time_series_name) AS pair_name
FROM series AS series1
JOIN series AS series2
ON series1.time_series_name < series2.time_series_name
)
SELECT pair_name, CORR(value1, value2) AS correlation
FROM (
SELECT pair_name, a1.dt AS dt, a1.value AS value1, a2.value AS value2
FROM pairs AS p
JOIN all_time_series AS a1
ON p.time_series_1 = a1.time_series_name
JOIN all_time_series AS a2
ON p.time_series_2 = a2.time_series_name
AND a1.dt = a2.dt
)
GROUP BY pair_name
ORDER BY correlation DESC
LIMIT 20
您可以使用虚拟数据进行上述测试,如下所示
#standardSQL
WITH all_time_series AS (
SELECT 'a' AS time_series_name, '2016-01-01' AS dt, 1 AS value UNION ALL
SELECT 'a', '2016-01-02', 2 UNION ALL
SELECT 'a', '2016-01-03', 3 UNION ALL
SELECT 'b', '2016-01-01', 1 UNION ALL
SELECT 'b', '2016-01-02', 2 UNION ALL
SELECT 'b', '2016-01-03', 3 UNION ALL
SELECT 'c', '2016-01-01', 5 UNION ALL
SELECT 'c', '2016-01-02', 6 UNION ALL
SELECT 'c', '2016-01-03', 7 UNION ALL
SELECT 'd', '2016-01-01', 6 UNION ALL
SELECT 'd', '2016-01-02', 2 UNION ALL
SELECT 'd', '2016-01-03', 3
),
series AS (
SELECT DISTINCT time_series_name
FROM all_time_series
),
pairs AS (
SELECT
series1.time_series_name AS time_series_1,
series2.time_series_name AS time_series_2,
CONCAT(series1.time_series_name, ' - ', series2.time_series_name) AS pair_name
FROM series AS series1
JOIN series AS series2
ON series1.time_series_name < series2.time_series_name
)
SELECT pair_name, CORR(value1, value2) AS correlation
FROM (
SELECT pair_name, a1.dt AS dt, a1.value AS value1, a2.value AS value2
FROM pairs AS p
JOIN all_time_series AS a1
ON p.time_series_1 = a1.time_series_name
JOIN all_time_series AS a2
ON p.time_series_2 = a2.time_series_name
AND a1.dt = a2.dt
)
GROUP BY pair_name
ORDER BY correlation DESC
LIMIT 2