Question

我有两个表，每个表都有一个user_id和group_name列。

例如

table1:

| user_id | group_name1|
------------------------
|    1    |   'groupA' |
|    1    |   'groupB' |
|    2    |   'groupA' |
|    1    |   'groupA' |
------------------------


table2:

| user_id | group_name2|
------------------------
|    1    |   'groupL' |
|    2    |   'groupL' |
|    3    |   'groupL' |
|    4    |   'groupN' |
|    1    |   'groupN' |
|    3    |   'groupN' |
------------------------

我正在尝试创建表2中用户出现在表1中的次数之间的计数分布，但是要在组内进行。

对于上面的示例，我会得到

| times_show_up | number_of_users | group_name1 | group_name2 |
---------------------------------------------------------------
|      0        |       1         |    groupA   |    groupL    |
|      1        |       1         |    groupA   |    groupL    |
|      2        |       1         |    groupA   |    groupL    |
|      0        |       2         |    groupB   |    groupL    |
|      1        |       1         |    groupB   |    groupL    |
|      2        |       0         |    groupB   |    groupL    |
|      0        |       2         |    groupA   |    groupN    |
|      1        |       0         |    groupA   |    groupN    |
|      2        |       1         |    groupA   |    groupN    |
|      0        |       2         |    groupB   |    groupN    |
|      1        |       1         |    groupB   |    groupN    |
|      2        |       0         |    groupB   |    groupN    |
----------------------------------------------------------------

解释一些行作为进一步的示例：

对于第1行，groupL中user_id = 3的用户在groupA中显示0次。对于第2行，groupL中user_id = 2的用户在groupA中出现一次。对于第3行，groupL中user_id = 1的用户在groupA中出现两次。

尽管在此示例中，一个人最多显示2次，但在实际数据中，这个数字是我不知道的任意大的数字。

对于其他组，如果我正确填写了所有内容，则与此类似。

我想出了一个查询，除了计算0之外，它可以完成所有这些工作，如下所示：

    SELECT 
        COUNT(user_id) AS num_users,
        times_show_up,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        RIGHT JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2)
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    )
    GROUP BY times_show_up, group_name1, group_name2

不幸的是，这不会返回times_show_up列中的0计数，而且我还没有找到一种无需大量子查询即可完成此操作的解决方案。一种可能的方法是仅运行子查询以获取所有组的所有组合的所有0，然后仅将这些行UNION移至表的其余部分。但由于组的数量非常多，因此我想避免针对每个可能的group1，group2组合包含子查询的方法。

某些限制包括此数据集上的partition by趋于耗尽内存，因此我想避免使用它。 更新后的要求：此外，由于在每个用户级别使用CROSS JOIN（因此将table1直接与table2交叉连接而不先对行进行分组）是不起作用的，因为每个表都有几千万行。 / p>

最后，number_of_users列中不必显示为0的行（如果这样的话就可以了，因为可以用简单的WHERE删除它们，但是如果有的话，则不必这样做）帮助查询）

更新：

我能够提出一个查询，该查询可以生成零，而只需要对每个group_name1进行单个查询，而不是对每个group_name1，group_name2组合进行单个查询。我将其添加到问题中，以防查询更少的问题，因为表1中的组数量仍然可能超过20+，这意味着通过{{1 }}。

UNION ALL

Answer 1

以下是用于BigQuery标准SQL的代码，最终变得相对简单

#standardSQL
SELECT times_show_up, 
  COUNT(DISTINCT user_id) number_of_users, 
  group_name1, group_name2
FROM (
  SELECT COUNTIF(a.user_id = b.user_id) times_show_up, 
    b.user_id, 
    group_name1, group_name2
  FROM table1 a
  CROSS JOIN table2 b
  GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2
-- ORDER BY group_name2, group_name1, times_show_up

如果要应用于您的问题的样本数据-结果为

Row times_show_up   number_of_users group_name1 group_name2  
1   0               1               groupA      groupL   
2   1               1               groupA      groupL   
3   2               1               groupA      groupL   
4   0               2               groupB      groupL   
5   1               1               groupB      groupL   
6   0               2               groupA      groupN   
7   2               1               groupA      groupN   
8   0               2               groupB      groupN   
9   1               1               groupB      groupN

... number_of_users列中带有0的行不必显示

注意：我遵循此规则，因为看起来您计划消除它们，以防结果出现这种情况

更新为...每个表都有数千万行。

尝试使用“优化”版本

#standardSQL
SELECT times_show_up, 
  COUNT(DISTINCT user_id) number_of_users, 
  group_name1, group_name2
FROM (
  SELECT SUM(IF(a.user_id = b.user_id, cnt, 0)) times_show_up, 
    b.user_id, 
    group_name1, group_name2
  FROM (SELECT user_id, group_name1, COUNT(1) cnt FROM table1 GROUP BY user_id, group_name1) a
  CROSS JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) b
  GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2

我没有相关数据可以测试，尽管这是否对您的特定数据有帮助

Answer 2

这是策略。

使用cross join生成行。
为此，请使用count(distinct)获取组。
使用派生表生成times_show_up。
汇总table1和table2
一起加入。

以下是查询：

select g1.group_name1, g2.group_name2, tsu.times_show_up,
       coalesce(t12.cnt, 0) as num_users
from (select distinct group_name1 from table1) g1 cross join
     (select distinct group_name2 from table2) t2 cross join
     (select 0 as times_show_up union all
      select 1 union all
      select 2
     ) tsu left join
     (select t1.group_name1, t2.group_name2, count(*) as cnt
      from table1 t1 join
           table2 t2
           on t2.user_id = t1.user_id
      group by t1.group_name1, t2.group_name2
     ) t12
     on t12.group_name1 = g1.group_name1 and
        t12.group_name2 = g2.group_name2 and
        t12.cnt = tsu.times_show_up;

如果您的数据确实有重复项，您可能希望在子查询中使用count(distinct user_id)而不是count(*)。

Answer 3

@Mikhail Berlyant的回答符合我的问题的原始要求。不幸的是，由于它依赖于user_id级别的交叉联接，并且有数千万个用户ID，因此对于我的特定用例而言，它需要很长时间才能完成。因此，我提供以下答案，该答案更快，但是确实要求对表1中的每个组进行附加查询（但不对group1和group2的每个组合进行查询），从而使查询不太简洁如果组的数量非常大，则可能超出BigQuery查询大小的限制。

如果您可以以编程方式为每个组生成查询，并且拥有数以百万计的用户的组较少，则首选此方法，而@Mikhail Berlyant的答案应该适用于存在更多组且用户数量较少的情况，并且在没有以编程方式完成查询生成的情况下，则必须为每个组编写一个。

SELECT * FROM
    (SELECT 
        times_show_up,
        COUNT(user_id) AS num_users,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        INNER JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2) t2
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    ) t1
    GROUP BY times_show_up, group_name1, group_name2) t9
    # Each subsequent query being UNIONed corresponds to a group in table 1
    UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupA' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1
     USING(user_id)
     GROUP BY group_name2)
     UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupB' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1
     USING(user_id)
     GROUP BY group_name2)
     --- ORDER BY group_name1, group_name2, times_show_up```

计算组中两个表之间不匹配的行

3 个答案: