给定100k行和100列的数据集,如何使用bigquery CORR()来查找行之间的相关性?
架构是:
id:integer, feature1:float, feature2:float, ..., feature100:float
编辑这不是滚动窗口时间序列关联问题。每行都是对100个特征的观察,我想使用bigquery来查找每行的前N个相似观察结果。
答案 0 :(得分:4)
您想要找到每列与其他列之间的相关性吗?
这将是这样的:
SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]
这可能需要很长时间才能完成(除非你自动化)。作为替代方案,请考虑一个不同的模式,其中所有内容都在3列中。转型将如下运行:
SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])
使用此架构,您可以使用更简单的查询运行所有组合列关联:
SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname
(这是我在@Tjorriemorrie链接的文章上所做的 - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)
请注意,第一个查询可能比最后一个查询更复杂,但我怀疑它将花费更少的时间来运行,因为不需要改组。
由于这个问题询问行,初始转换会类似,但略有不同:
SELECT column, value, rowid FROM
(SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
(SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
(SELECT 'c3' column, c3 AS value, rowid FROM [mytable])
然后行之间的相关性计算如下:
SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid
答案 1 :(得分:1)
对于聚合函数:
CORR(numeric_expr, numeric_expr) Returns the Pearson correlation coefficient of a set of number pairs.
我建议你看一下博文: http://googlecloudplatform.blogspot.com/2013/09/introducing-corr-to-google-bigquery.html
SELECT CORR(a.data, b.data) corr, a.room room, count(*) c
FROM (
SELECT
TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(DATA) data, room
FROM [io_sensor_data.moscone_io13]
WHERE
DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
AND sensortype='temperature'
GROUP EACH BY time, room) a
JOIN EACH (
SELECT
TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(data) data, room
FROM [io_sensor_data.moscone_io13]
WHERE
DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-17'
AND sensortype='temperature'
GROUP EACH BY time, room) b
ON a.time=b.time AND a.room = b.room
GROUP EACH BY room
HAVING
corr IS NOT NULL
AND c > 800
ORDER EACH BY corr DESC
在a
子句的帮助下,似乎您使用子集选择b
并使用连接设置where
。我不知道您希望进行相关的值/时间框架,但您应该能够相应地构造它。我希望有所帮助。