如何使用基于多列的bigquery关联?

时间:2014-08-31 04:20:57

标签: google-bigquery

给定100k行和100列的数据集,如何使用bigquery CORR()来查找行之间的相关性?

架构是:

id:integer, feature1:float, feature2:float, ..., feature100:float

编辑这不是滚动窗口时间序列关联问题。每行都是对100个特征的观察,我想使用bigquery来查找每行的前N个相似观察结果。

2 个答案:

答案 0 :(得分:4)

您想要找到每列与其他列之间的相关性吗?

这将是这样的:

SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]

这可能需要很长时间才能完成(除非你自动化)。作为替代方案,请考虑一个不同的模式,其中所有内容都在3列中。转型将如下运行:

SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])

使用此架构,您可以使用更简单的查询运行所有组合列关联:

SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname

(这是我在@Tjorriemorrie链接的文章上所做的 - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html

请注意,第一个查询可能比最后一个查询更复杂,但我怀疑它将花费更少的时间来运行,因为不需要改组。

由于这个问题询问行,初始转换会类似,但略有不同:

SELECT column, value, rowid FROM
  (SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
  (SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
  (SELECT 'c3' column, c3 AS value, rowid FROM [mytable]) 

然后行之间的相关性计算如下:

SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid

答案 1 :(得分:1)

对于聚合函数:

CORR(numeric_expr, numeric_expr)    Returns the Pearson correlation coefficient of a set of number pairs.

我建议你看一下博文: http://googlecloudplatform.blogspot.com/2013/09/introducing-corr-to-google-bigquery.html

SELECT CORR(a.data, b.data) corr, a.room room, count(*) c
FROM (
  SELECT
    TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(DATA) data, room
  FROM [io_sensor_data.moscone_io13]
  WHERE
    DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16'
    AND sensortype='temperature'
    GROUP EACH BY time, room) a
JOIN EACH (
  SELECT
    TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(data) data, room
  FROM [io_sensor_data.moscone_io13]
  WHERE
    DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-17'
    AND sensortype='temperature'
    GROUP EACH BY time, room) b
  ON a.time=b.time AND a.room = b.room
  GROUP EACH BY room
HAVING
  corr IS NOT NULL
  AND c > 800
  ORDER EACH BY corr DESC

a子句的帮助下,似乎您使用子集选择b并使用连接设置where。我不知道您希望进行相关的值/时间框架,但您应该能够相应地构造它。我希望有所帮助。