BQ SQL解决方案,用于根据方差比较行

时间:2017-09-22 19:55:17

标签: sql google-bigquery bigdata gcp

我试图比较BigQuery中的零售商品价格数据(〜2-3B行,具体取决于时间段和零售商);旨在识别有意义的价格差异。例如1.99美元兑2.00美元没有意义,但1.99美元兑2.50美元是有意义的。有意义的量化为价格之间2%的差异。

一个项目的示例数据集如下所示:

ITEM       Price($)  Meaningful (This is the column I'm trying to flag) 
Apple      $1.99     Y (lowest price would always be flagged)
Apple      $2.00     N ($1.99 v $2.00)
Apple      $2.01     N ($1.99 v $2.01)  Still using $1.99 for comparison
Apple      $2.50     Y ($1.99 v $2.50)  Still using $1.99 for comparison
Apple      $2.56     Y ($2.50 v $2.56)  Now using $2.50 as new comp. price
Apple      $2.62     Y ($2.55 v $2.62)  Now using $2.56 as new comp. price

我希望只使用SQL窗口函数(超前,滞后,分区等)来解决问题,将当前行的价格与下一行的价格进行比较。但是,当我得到一个没有意义的价格时,这并没有正常工作,因为我总是希望将下一个价值与最近有意义的价格进行比较(参见上面的2.50美元行示例,与2.00美元相比)前一行不是2.01美元)

我的问题:

  • 在BigQuery中单独使用SQL是否可以解决这个问题? (例如,我忽略了什么样的创意SQL逻辑解决方案,比如根据差异金额进行分组?)
  • 我可以使用哪些编程选项,因为我无法在BQ中使用存储过程? GCP Datalab中的Python / Dataframe? BQ UDF?

1 个答案:

答案 0 :(得分:1)

以下是BigQuery Standard SQL

  
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
  var result = [];
  var last = 0;
  var flag = '';
  for (i = 0; i < prices.length; i++){
    if (i == 0) {
      last = prices[i];
      flag = 'Y'
    } else {
      if ((prices[i] - last)/last > 0.02) {
        last = prices[i];
        flag = 'Y'
      } else {flag = 'N'}
    }
    var rec = [];
    rec.price = prices[i];
    rec.flag = flag;
    result.push(rec); 
  } 
  return result;
""";
SELECT item, rec.* 
FROM (
  SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
  FROM `yourTable`
  GROUP BY item
), UNNEST(x(prices) ) AS rec
-- ORDER BY item, price  

您可以使用您问题中的以下虚拟数据进行/测试

#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
  var result = [];
  var last = 0;
  var flag = '';
  for (i = 0; i < prices.length; i++){
    if (i == 0) {
      last = prices[i];
      flag = 'Y'
    } else {
      if ((prices[i] - last)/last > 0.02) {
        last = prices[i];
        flag = 'Y'
      } else {flag = 'N'}
    }
    var rec = [];
    rec.price = prices[i];
    rec.flag = flag;
    result.push(rec); 
  } 
  return result;
""";
WITH `yourTable` AS (
  SELECT 'Apple' AS item, 1.99 AS price UNION ALL
  SELECT 'Apple', 2.00 UNION ALL
  SELECT 'Apple', 2.01 UNION ALL
  SELECT 'Apple', 2.50 UNION ALL
  SELECT 'Apple', 2.56 UNION ALL
  SELECT 'Apple', 2.62 
)
SELECT item, rec.* 
FROM (
  SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
  FROM `yourTable`
  GROUP BY item
), UNNEST(x(prices) ) AS rec
ORDER BY item, price    

结果如下

item    price   flag     
----    -----   ----
Apple   1.99    Y    
Apple   2.0     N    
Apple   2.01    N    
Apple   2.5     Y    
Apple   2.56    Y    
Apple   2.62    Y