我有两个表(light
和enhanced
),模式略有不同。第二个表(enhanced
)具有其他可为空的字段t
。
我想基于两个字段(d,p
)
为了达到这个目的,我使用带星号的子查询联合
select
d
, p
,count(1) - count(t) light_cnt
,count(t) enhanced_cnt
from
(select * from light)
, (select * from enhanced)
group by
d, p
但是这个查询会返回错误的计数(约为两倍) 只有在我按两个字段分组时才会发生这种情况。单场效果很好。当我在另一个子查询中包装union时,我发现它可以正常工作
select
d
, p
,count(1) - count(t) light_cnt
,count(t) enhanced_cnt
from
(select * from
(select * from light)
, (select * from enhanced)
)
group by
d, p
这是一个错误还是我做错了什么?
我已经使用group by
where
同样的损坏行为
select count(1) from enhanced where p = 124
返回292
select count(1) from light where p = 124
返回12512
select count(1)
from (select * from light), (select * from enhanced)
where p = 124
返回12804,这是正确的,而
select count(1), count(t)
from (select * from light), (select * from enhanced)
where p = 124
返回24527,501 ......非常奇怪。似乎是一个错误。
解决方法:
select count(1), count(t)
from (select * from (select * from light), (select * from enhanced))
where p = 124
返回12804,292。正确。
表light
和enhanced
都有从avro继承的复杂模式。有记录和重复的字段。为简单起见,字段p
和t
在上面的选择中处于abbverviated形式。 Real是p -> record.record.record.id
(leaf是整数),t -> record.time
(leaf是整数)。路径p
和t
中的所有记录都不可重复。所有都可以为空。
答案 0 :(得分:2)
请参阅Difference in statistics from Google Analytics Report and BigQuery Data in Hive table,您可能遇到类似的问题,因为您正在平滑来自2个重复列的数据:这会产生n * m个结果。
一个证明这一点的查询:
SELECT col, x FROM (
SELECT "wrong" col, SUM(totals.pageviews) x
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
SELECT "correct" col, SUM(totals.pageviews) x
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
col x
wrong 2262
correct 249
一个查询,显示您正在经历的内容。所有结果数字应该相同,但它们不是 - 结果取决于BigQuery所做的FLATTEN()选择。如果你想获得正确的数字,请明确说明BigQuery在计算之前应该如何压扁表格:
SELECT a, b, c FROM (
SELECT COUNT(1) a
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(customDimensions.index) b
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1) a, COUNT(customDimensions.index) b, COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1) a, COUNT(customDimensions.index) b
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1) a, COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(customDimensions.index) b, COUNT(hits.hitNumber) c
FROM (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ), (SELECT * FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
),(
SELECT COUNT(1) a, COUNT(customDimensions.index) b, COUNT(hits.hitNumber) c
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
Row a b c
1 126
2 510
3 724
4 126 102 766
5 726 510
6 126 724
7 102 766
8 126 102 724