Bigquery返回重复的行以及错误的计数

时间:2017-09-28 17:16:36

标签: google-bigquery

我在用户界面上直接运行BigQuery,查询结果重复(每行有一个副本)得到120个结果。我也用相同的语句测试select count(*),结果仍然得到120。即使将结果作为csv文件下载到本地磁盘,数据仍然是重复的。我环顾四周但无法获得任何有益的观点。有什么建议吗?

  1. id 1
  2. 名称约翰
  3. 州VA
  4. 关键字老化
  5. title
  6. budget_start 2016-01-30
  7. budget_end 2018-03-31
  8. total_cost 250000.0
  9. id是必需的,其他可以为null; budget_start和budget_end是日期类型,total_cost是float,而其他列是字符串

2 个答案:

答案 0 :(得分:1)

从您的查询中 - 很明显您使用的是BigQuery Legacy SQL Legacy SQL的输出细节是它变得扁平了 这意味着如果您有嵌套行 - 它们将被展平

  

见下面的例子

#legacySQL
SELECT id, NEST(x) AS xs
FROM 
(SELECT 1 AS id, 2 AS x),
(SELECT 1 AS id, 3 AS x),
(SELECT 1 AS id, 4 AS x),
(SELECT 2 AS id, 5 AS x),
(SELECT 2 AS id, 6 AS x)
GROUP BY id  

它创建两行,如下所示

Row id  xs  
1   1   [2,3,4]  
2   2   [5,6]  

您可以通过使用目标表运行此查询来检查此项,然后预览此表

现在 - 如果您在Web UI中运行相同的查询(在旧SQL中) - 您将获得5行而不是“预期”2行

Row id  xs   
1   1   2    
2   1   3    
3   1   4    
4   2   5    
5   2   6      

请注意:扁平化只发生在最终外层 - 子查询不会变平。例如,下面的查询将为您提供count = 2,如您所期望的那样

#legacySQL
SELECT COUNT(1) AS cnt FROM (
  SELECT id, NEST(x) AS xs
  FROM 
  (SELECT 1 AS id, 2 AS x),
  (SELECT 1 AS id, 3 AS x),
  (SELECT 1 AS id, 4 AS x),
  (SELECT 2 AS id, 5 AS x),
  (SELECT 2 AS id, 6 AS x)
  GROUP BY id
)  


Row cnt  
1   2    

所以,为了解决这个问题,我建议您migrate to BigQuery Standard SQL

请参阅BigQuery Standard SQL的等效示例

#standardSQL
WITH `yourTable` AS (
  SELECT 1 AS id, [2,3,4] AS xs UNION ALL
  SELECT 2, [5,6]
)
SELECT * FROM `yourTable`

输出只有两行,正如人们所期望的那样

Row id  xs   
1   1   2    
        3    
        4    
2   2   5    
        6    

答案 1 :(得分:0)

非常感谢米哈伊尔提出的富有洞察力的建议!我实际上发现了问题,我从Google Storage导入了两次相同的表(在第一次导入时发现一些错误,纠正错误并再次加载)导致一个包含重复内容的表(我认为已被替换但实际合并)我做了没有意识到