不同的数据类型关联数据倾斜

时间:2016-06-18 11:39:31

标签: performance hive

今天我读了一篇关于蜂巢调音的文章。一段如下:

场景:user_id在user表中的字段user_id INT,日志表字段都是字符串类型int。当两个表按照user_id Join操作时,默认的Hash操作将被分配int id,这将导致所有记录的字符串类型id分配给reducer。

解决方案:数字类型转换为字符串类型

select * from users a 
left outer join logs b 
on a.usr_id = cast (b. user_id as string) 

任何人都可以给我一些关于上述观点的更多解释,我真的无法理解作者描述的话。为什么“这将导致分配给reducer的字符串类型id的所有记录。”发生了什么?提前谢谢!

1 个答案:

答案 0 :(得分:0)

For starters you did not copy and paste / transcribe the original properly. Here is the more likely wording:

this will cause all records of the string type id assigned to a single reducer.

The reason that would happen is that the conversion of string to int without the cast is probably turning it to 0. Therefore the hashing will put all of the id's into the same partition for the 0 values.