BigQuery:如何做半左联接?

时间:2018-08-17 04:38:32

标签: google-bigquery

对于这个问题,我无法拿出好标题。对于那个很抱歉。

我有两个表A和B。它们都有时间戳,并且在它们之间共享一个公共ID。这是两个表的架构:

Table A:
========
a_id int,
common_id int,
ts timestamp
...

Table B:
========
b_id int,
common_id int,
ts timestamp,
temperature int

表A每次更改状态时都更像设备数据。表B是更多的IoT数据,其中包含每分钟左右的设备温度。

我要做的是从这两个表中创建一个表C。表C本质上就是表A +表B中最接近时间的温度。

如何仅在BigQuery SQL中执行此操作?温度信息不需要精确。

2 个答案:

答案 0 :(得分:2)

下面的选项(对于BigQuery Standard SQL)假设除了表b中的temperature之外,您还需要相应行中的所有其余值

#standardSQL
SELECT 
  ARRAY_AGG(
    STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature) 
    ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND)) 
    LIMIT 1
  )[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a 
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id 
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)

我用下面生成的虚拟数据进行了烟雾测试

#standardSQL
WITH `project.dataset.table_a` AS ( 
  SELECT CAST(1000000 * RAND() AS INT64) a_id, common_id, ts
  FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 45*60 + 27 SECOND)) ts
  CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
), `project.dataset.table_b` AS ( 
  SELECT CAST(1000000 * RAND() AS INT64) b_id, common_id, ts, CAST(60 + 40 * RAND() AS INT64) temperature 
  FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 1 MINUTE)) ts
  CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
) 
SELECT 
  ARRAY_AGG(
    STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature) 
    ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND)) 
    LIMIT 1
  )[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a 
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id 
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)  

以输出中的几行为例:

Row a_id    common_id ts                        b_id    b_ts                    temperature  
1   276623  1         2018-01-01 00:00:00 UTC   166995  2018-01-01 00:00:00 UTC     74   
2   218354  1         2018-01-01 00:45:27 UTC   464901  2018-01-01 00:45:00 UTC     87   
3   265634  1         2018-01-01 01:30:54 UTC   565385  2018-01-01 01:31:00 UTC     87   
4   758075  1         2018-01-01 02:16:21 UTC   55894   2018-01-01 02:16:00 UTC     84   
5   306355  1         2018-01-01 03:01:48 UTC   844429  2018-01-01 03:02:00 UTC     92   
6   348502  1         2018-01-01 03:47:15 UTC   375859  2018-01-01 03:47:00 UTC     90   
7   774920  1         2018-01-01 04:32:42 UTC   438164  2018-01-01 04:33:00 UTC     61   

在这里-我将table_b设置为在'2018-01-01'一整天中每10个设备的分钟温度,在table_a中,我设置了同一天在同一10个设备上每45分钟27秒更改一次状态的时间。 a_id和b_id-只是0到999999之间的随机数

注意:ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30中的JOIN子句控制您可以考虑查找最接近的ts的时间段(以防table_b中缺少某些IoT条目的情况

答案 1 :(得分:1)

通过WITH a AS ( SELECT 1 id, TIMESTAMP('2018-01-01 11:01:00') ts UNION ALL SELECT 1, ('2018-01-02 10:00:00') UNION ALL SELECT 2, ('2018-01-02 10:00:00') ) , b AS ( SELECT 1 id, TIMESTAMP('2018-01-01 12:01:00') ts, 43 temp UNION ALL SELECT 1, TIMESTAMP('2018-01-01 12:06:00'), 47 ) SELECT *, (SELECT temp FROM b WHERE a.id=b.id ORDER BY ABS(TIMESTAMP_DIFF(a.ts,b.ts, SECOND)) LIMIT 1) temp FROM a 来测量最接近的时间-通过其绝对值来获取在任何方向上最接近的时间:

<ScrollView
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:fillViewport="true">
    <LinearLayout
        android:layout_width="match_parent"
        android:layout_height="match_parent"
        android:orientation="vertical"
        android:weightSum="2">

enter image description here