保留加入Google BigQuery的最新日期

时间:2018-12-11 14:29:53

标签: sql google-bigquery

我有两个表,都带有时间戳和更多数据:

表A

| name | timestamp           | a_data |
| ---- | ------------------- | ------ |
| 1    | 2018-01-01 11:10:00 | a      |
| 2    | 2018-01-01 12:20:00 | b      |
| 3    | 2018-01-01 13:30:00 | c      |

表B

| name | timestamp           | b_data |
| ---- | ------------------- | ------ |
| 1    | 2018-01-01 11:00:00 | w      |
| 2    | 2018-01-01 12:00:00 | x      |
| 3    | 2018-01-01 13:00:00 | y      |
| 3    | 2018-01-01 13:10:00 | y      |
| 3    | 2018-01-01 13:10:00 | z      |

我想做的是

  1. 对于表A LEFT JOIN中的每一行,表B 中的最新记录早于该记录。
  2. 当可能性不止一种时,采取最后一种可能性

目标结果

| name | timestamp           | a_data | b_data |
| ---- | ------------------- | ------ | ------ |
| 1    | 2018-01-01 11:10:00 | a      | w      |
| 2    | 2018-01-01 12:20:00 | b      | x      |
| 3    | 2018-01-01 13:30:00 | c      | z      | <-- note z, not y

我认为这涉及一个子查询,但是我无法在Big Query中使用它。到目前为止,我所拥有的:

SELECT a.a_data, b.b_data
FROM `table_a` AS  a  

LEFT JOIN `table_b` AS b 
ON a.name = b.name

WHERE a.timestamp = (
  SELECT max(timestamp) from `table_b` as sub
  WHERE sub.name = b.name
  AND sub.timestamp < a.timestamp
)

在我的实际数据集上(这是一个很小的测试集(在2Mb以下)),查询运行但从未完成。任何赞赏的指针

4 个答案:

答案 0 :(得分:2)

您可以尝试使用选择子查询。

SELECT a.*,(
    SELECT MAX(b.b_data) 
    FROM `table_b` AS b 
    WHERE 
        a.name = b.name 
    and  
        b.timestamp < a.timestamp
) b_data 
FROM `table_a` AS  a

编辑

或者您可以尝试在子查询中使用ROW_NUMBER窗口函数。

SELECT name,timestamp,a_data , b_data
FROM (
    SELECT a.*,b.b_data,ROW_NUMBER() OVER(PARTITION BY a.name ORDER BY b.timestamp desc,b.name desc) rn 
    FROM `table_a` AS  a  
    LEFT JOIN `table_b` AS b ON a.name = b.name AND b.timestamp < a.timestamp
) t1
WHERE rn = 1

答案 1 :(得分:1)

在BigQuery中,数组通常是解决此类问题的有效方法:

SELECT a.a_data, b.b_data
FROM `table_a` a LEFT JOIN
     (SELECT b.name,
             ARRAY_AGG(b.b_data ORDER BY b.timestamp DESC LIMIT 1)[OFFSET(1)] as b_data
      FROM `table_b` b 
      GROUP BY b.name
     ) b
     ON a.name = b.name;

答案 2 :(得分:1)

以下内容适用于BigQuery Standard SQL,不需要在两侧都指定所有列-仅nametimestamp。因此,它将适用于两个表中的任意数量的列(假设名称中没有歧义,而不是上面提到的两列)

#standardSQL
SELECT a.*, b.* EXCEPT (name, timestamp)
FROM (
  SELECT 
    ANY_VALUE(a) a, 
    ARRAY_AGG(b ORDER BY b.timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] b
  FROM `project.dataset.table_a` a
  LEFT JOIN `project.dataset.table_b` b
  USING (name)
  WHERE a.timestamp > b.timestamp
  GROUP BY TO_JSON_STRING(a)
)

答案 3 :(得分:1)

这是常见的情况,您不能仅Group by并获得最低要求。我建议以下内容:

SELECT *
FROM table_a as a inner join (SELECT name, min(timestamp) as timestamp
                              FROM table_b group by 1) as b 
on (a.timestamp = b.timestamp and a.name = b.name)

通过这种方式,您可以将其限制为表b中指定的最小值。

您还可以使用WITH语句以更具可读性的方式实现这一目标:

WITH min_b as (
SELECT name, 
min(timestamp) as timestamp
FROM table_b group by 1
)
SELECT *
FROM table_a as a inner join min_b 
on (a.timestamp = min_b.timestamp and a.name = min_b.name) 

让我知道它是否有效!