使用BigQuery标准偏差检测离群值

时间:2018-10-25 07:33:02

标签: statistics google-bigquery standard-deviation

我目前在BigQuery中有一个表格,其中包含一些异常值

示例表:

ng-serve

我希望能够使用SQL和标准差在2018/06/11筛选出各个端口上的异常值

结果:

npm WARN queueing-subject@0.3.0 requires a peer of rxjs@^6.1.0 but none is installed. You must install peer dependencies yourself.
npm WARN rxjs-websockets@6.0.2 requires a peer of rxjs@^6.1.0 but none is installed. You must install peer dependencies yourself.

我进行了一些研究,发现标准差可以帮助筛选出异常值。但是,我不知道如何编写SQL查询来使这项工作。任何帮助将不胜感激。

(这是我可以在此主题上找到的最近的线程:Using BigQuery to find outliers with standard deviation results combined with WHERE clause

1 个答案:

答案 0 :(得分:2)

以下示例适用于BigQuery标准SQL

 const shareOptions = {
  title: 'Title',
  message: 'Message',
  urls: reviewShare.state.blobArr
};
Share.open(shareOptions)
.then((res) => { console.log(res) })
.catch((err) => { err && console.log(err); });

您可以使用问题中的虚拟数据进行上述测试和操作:

#standardSQL
WITH stats AS (
  SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
    AVG(qty) - 1.5 * STDDEV(qty) down,
    AVG(qty) + 1.5 * STDDEV(qty) up
  FROM `project.dataset.table`
  GROUP BY dt
)
SELECT port, qty, datetime 
FROM `project.dataset.table`
JOIN stats 
ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
WHERE NOT qty BETWEEN down AND up  

结果为

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'TCP1' port, 13 qty, '2018/06/11 11:20:23' datetime UNION ALL
  SELECT 'UDP2', 15, '2018/06/11 11:24:24' UNION ALL
  SELECT 'TCP3', 14, '2018/06/11 11:24:27' UNION ALL
  SELECT 'TCP1', 2 , '2018/06/11 11:24:26' UNION ALL 
  SELECT 'UDP2', 15, '2018/06/11 11:35:32' UNION ALL
  SELECT 'TCP3', 13, '2018/06/11 11:45:23' UNION ALL
  SELECT 'TCP3', 14, '2018/06/11 11:54:22' UNION ALL
  SELECT 'TCP3', 30, '2018/06/11 11:55:33' 
), stats AS (
  SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
    AVG(qty) - 1.5 * STDDEV(qty) down,
    AVG(qty) + 1.5 * STDDEV(qty) up
  FROM `project.dataset.table`
  GROUP BY dt
)
SELECT port, qty, datetime 
FROM `project.dataset.table`
JOIN stats 
ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
WHERE NOT qty BETWEEN down AND up