Hive在create table中返回太多行作为select

时间:2015-02-10 09:30:30

标签: hadoop duplicates hive create-table

在简单查询中隐藏不需要的重复

我们的数据:

  • table xx包含唯一的uuid1s
  • 结果包含(在某些情况下)重复的uuid1s
  • 我们有256个uuid1分片

    创建表格xx为( uuid1shard char(2), uuid1 char(36) 时间戳时间戳, uuid2shard char(2), uuid2(char(36), col001 .. col190 int / bigint / char, 有效载荷varchar(100) )

我们的调查结果:

工作正常:

select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from (select * from xx where uuidshard2 < 192)

以某种方式工作(count(*)&gt; count(distinct uuid2):

select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from (select * from xx where uuidshard2 < 192)

结果(~800M)在通过create table

保存时包含cca 8000重复行(完全相同)

一切正常,只需选中 -

select count(*), count(distinct uuid2) from 
(
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from xx
)

再次工作: - 只是忽略剩余的190列

select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
, partition by uuid1shard, uuid1, timestamp, uuid2shard, uuid2, smallpayload
from xx 

根本原因 固定(所有重复的来源是什么?):

create table yy as --the CREATE, yes, the problem is probably in the create statement
select
row_number() over (partition by uuid2shard, uuid2 order by (timestamp, uuid2shard, uuid2))
,*
from xx

看起来像'大小'问题,或者必须设置一些东西以使其正常工作

我们的环境:

小一点;) 16个节点,384个VCores等...

0 个答案:

没有答案