当其他列相同时,Bigquery会找到最常见的值

时间:2018-09-17 14:52:37

标签: sql group-by google-bigquery

我想添加列New_Family_id,并在标题相同的情况下用最常见的Family_id填充。

Row GlobalId            ProductTitleNL      FamilyId            New_Family_id
1   9200000005045711    ! at Gun Point...   9200000005045710    9200000011427871
2   9200000003809684    ! at Gun Point...   9200000011427871    9200000011427871
3   9200000011427872    ! at Gun Point...   9200000011427871    9200000011427871
4   1001004011099420    Russian Dat         34388968            34388968
5   1001004011099421    Russian Dat         35434738            34388968
6   9200000000530359    !!Nos Vemos!        9200000000530358    9200000000530358
7   9200000000530343    !!Nos Vemos!        9200000000530342    9200000000530358

我尝试了几次分组依据,但没有任何效果。

我已经有:

SELECT a.GlobalId, a.ProductTitleNL, a.FamilyId, a.Language, b.aantal_T
FROM table1 as a

JOIN (SELECT ProductTitleNL, COUNT(ProductTitleNL) as aantal_T
FROM table1
Group by ProductTitleNL
HAVING aantal_T >= 2) b
ON a.ProductTitleNL = b.ProductTitleNL

Group by a.GlobalId, a.ProductTitleNL, a.FamilyId, a.Language, b.aantal_T
Order by a.ProductTitleNL;

感谢您的提前帮助!

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
SELECT * EXCEPT(ids), 
  (SELECT id FROM UNNEST(ids) id GROUP BY id ORDER BY COUNT(1) DESC LIMIT 1) New_Family_id
FROM (
  SELECT *, ARRAY_AGG(FamilyId) OVER(PARTITION BY ProductTitleNL) ids
  FROM `project.dataset.table`
)

您可以使用下面的问题中的虚拟数据来测试,玩耍

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 9200000005045711 GlobalId, '! at Gun Point...' ProductTitleNL, 9200000005045710 FamilyId UNION ALL
  SELECT 9200000003809684, '! at Gun Point...', 9200000011427871 UNION ALL
  SELECT 9200000011427872, '! at Gun Point...', 9200000011427871 UNION ALL
  SELECT 1001004011099420, 'Russian Dat', 34388968 UNION ALL
  SELECT 1001004011099421, 'Russian Dat', 35434738 UNION ALL
  SELECT 9200000000530359, '!!Nos Vemos!', 9200000000530358 UNION ALL
  SELECT 9200000000530343, '!!Nos Vemos!', 9200000000530342 
)
SELECT * EXCEPT(ids), 
  (SELECT id FROM UNNEST(ids) id GROUP BY id ORDER BY COUNT(1) DESC LIMIT 1) New_Family_id
FROM (
  SELECT *, ARRAY_AGG(FamilyId) OVER(PARTITION BY ProductTitleNL) ids
  FROM `project.dataset.table`
)   

有结果

Row GlobalId            ProductTitleNL      FamilyId            New_Family_id    
1   9200000005045711    ! at Gun Point...   9200000005045710    9200000011427871     
2   9200000003809684    ! at Gun Point...   9200000011427871    9200000011427871     
3   9200000011427872    ! at Gun Point...   9200000011427871    9200000011427871     
4   9200000000530359    !!Nos Vemos!        9200000000530358    9200000000530358     
5   9200000000530343    !!Nos Vemos!        9200000000530342    9200000000530358     
6   1001004011099420    Russian Dat 34388968                    34388968     
7   1001004011099421    Russian Dat 35434738                    34388968