bigquery:依赖于子查询的联合

时间:2015-05-27 17:25:19

标签: google-bigquery

我有两个表(lightenhanced),模式略有不同。第二个表(enhanced)具有其他可为空的字段t。 我想基于两个字段(d,p

从每个组的第一个和第二个表中获取行数

为了达到这个目的,我使用带星号的子查询联合

select
  d
  , p
  ,count(1) - count(t) light_cnt
  ,count(t) enhanced_cnt
from 
  (select * from light)
  , (select * from enhanced)
group by
  d, p

但是这个查询会返回错误的计数(约为两倍) 只有在我按两个字段分组时才会发生这种情况。单场效果很好。当我在另一个子查询中包装union时,我发现它可以正常工作

select
  d
  , p
  ,count(1) - count(t) light_cnt
  ,count(t) enhanced_cnt
from
  (select * from 
      (select * from light)
      , (select * from enhanced)
  )
group by
  d, p

这是一个错误还是我做错了什么?

编辑:

我已经使用group by

重现了where同样的损坏行为
select count(1) from enhanced where p = 124

返回292

select count(1) from light where p = 124

返回12512

select count(1)
from (select * from light), (select * from enhanced)
where p = 124

返回12804,这是正确的,而

select count(1), count(t)
from (select * from light), (select * from enhanced)
where p = 124

返回24527,501 ......非常奇怪。似乎是一个错误。

解决方法:

select count(1), count(t)
from (select * from (select * from light), (select * from enhanced))
where p = 124

返回12804,292。正确。

lightenhanced都有从avro继承的复杂模式。有记录和重复的字段。为简单起见,字段pt在上面的选择中处于abbverviated形式。 Real是p -> record.record.record.id(leaf是整数),t -> record.time(leaf是整数)。路径pt中的所有记录都不可重复。所有都可以为空。

1 个答案:

答案 0 :(得分:2)

请参阅Difference in statistics from Google Analytics Report and BigQuery Data in Hive table,您可能遇到类似的问题,因为您正在平滑来自2个重复列的数据:这会产生n * m个结果。

一个证明这一点的查询:

SELECT col, x FROM (
  SELECT "wrong" col, SUM(totals.pageviews) x
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
  SELECT "correct" col, SUM(totals.pageviews) x
  FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)

col     x    
wrong   2262     
correct 249 

一个查询,显示您正在经历的内容。所有结果数字应该相同,但它们不是 - 结果取决于BigQuery所做的FLATTEN()选择。如果你想获得正确的数字,请明确说明BigQuery在计算之前应该如何压扁表格:

SELECT a, b, c    FROM (
SELECT COUNT(1)  a
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(customDimensions.index) b
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1)  a, COUNT(customDimensions.index) b, COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1)  a, COUNT(customDimensions.index) b
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1)  a, COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(customDimensions.index) b, COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1) a, COUNT(customDimensions.index) b, COUNT(hits.hitNumber) c
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])

Row a   b   c    
1   126          
2       510      
3           724  
4   126 102 766  
5   726 510      
6   126     724  
7       102 766  
8   126 102 724