Hive输出创建了一个巨大的零件文件和许多小零件文件

时间:2015-09-09 20:01:29

标签: hadoop hive hiveql

我目前在使用HIVE中的连接时遇到问题。连接两个大表会导致输出中有一个大文件和许多小文件。以下是我的详细情景

  • 除了其他类型为string的列外,表A还有x_address,y_address,z_address列。
  • 列x_address,y_address,z_address携带地址的哈希键 表A中的列。
  • 表A有大约1.1亿条记录
  • 表B是提供经度和经度的查找表 对于特定的地址。
  • 表B中的键(address_key)是地址的哈希键

我的HQL是

INSERT OVERWRITE TABLE FINAL_TABLE    
SELECT 
    a.x_address_full, 
    b1.lat,
    b1.long,
    a.y_address_full, 
    b2.lat,
    b2.long,
    a.z_address_full, 
    b3.lat,
    b3.long,
    FROM 
    TABLE_A a 
    LEFT JOIN TABLE_B b1 ON a.x_address = b1.address_key
    LEFT JOIN TABLE_B b2 ON a.x_address = b2.address_key
    LEFT JOIN TABLE_B b3 ON a.x_address = b3.address_key

在Filebrowser(Hue)中,我看到创建了大约50个零件文件。

-rwxr-xr-x   3 user user 103488475533 2015-09-08 20:18 FINAL_TABLE/000000_0
-rwxr-xr-x   3 user user     18887004 2015-09-08 16:43 FINAL_TABLE/000001_0
-rwxr-xr-x   3 user user     16806648 2015-09-08 16:43 FINAL_TABLE/000002_0
-rwxr-xr-x   3 user user     17759878 2015-09-08 16:43 FINAL_TABLE/000003_0
-rwxr-xr-x   3 user user     19229971 2015-09-08 16:43 FINAL_TABLE/000004_0
-rwxr-xr-x   3 user user     17361505 2015-09-08 16:43 FINAL_TABLE/000005_0
-rwxr-xr-x   3 user user     20935119 2015-09-08 16:43 FINAL_TABLE/000006_0
-rwxr-xr-x   3 user user     18525756 2015-09-08 16:43 FINAL_TABLE/000007_0
-rwxr-xr-x   3 user user     18155867 2015-09-08 16:43 FINAL_TABLE/000008_0
-rwxr-xr-x   3 user user     18388192 2015-09-08 16:43 FINAL_TABLE/000009_0
-rwxr-xr-x   3 user user     17352032 2015-09-08 16:43 FINAL_TABLE/000010_0
-rwxr-xr-x   3 user user     20586196 2015-09-08 16:43 FINAL_TABLE/000011_0
-rwxr-xr-x   3 user user     19026628 2015-09-08 16:43 FINAL_TABLE/000012_0
-rwxr-xr-x   3 user user     18492712 2015-09-08 16:43 FINAL_TABLE/000013_0
-rwxr-xr-x   3 user user     20525139 2015-09-08 16:43 FINAL_TABLE/000014_0
-rwxr-xr-x   3 user user     18767626 2015-09-08 16:43 FINAL_TABLE/000015_0
-rwxr-xr-x   3 user user     18759833 2015-09-08 16:43 FINAL_TABLE/000016_0
-rwxr-xr-x   3 user user     17625431 2015-09-08 16:43 FINAL_TABLE/000017_0
-rwxr-xr-x   3 user user     17589284 2015-09-08 16:43 FINAL_TABLE/000018_0
-rwxr-xr-x   3 user user     19635568 2015-09-08 16:43 FINAL_TABLE/000019_0
-rwxr-xr-x   3 user user     18782632 2015-09-08 16:43 FINAL_TABLE/000020_0
-rwxr-xr-x   3 user user     18468366 2015-09-08 16:43 FINAL_TABLE/000021_0
-rwxr-xr-x   3 user user     19348518 2015-09-08 16:43 FINAL_TABLE/000022_0
-rwxr-xr-x   3 user user     19132130 2015-09-08 16:43 FINAL_TABLE/000023_0
-rwxr-xr-x   3 user user     19661123 2015-09-08 16:43 FINAL_TABLE/000024_0

注意:所有表都是基于AvroSerDe的。

根据我到目前为止的分析,这似乎可能是由于x_address或y_address或z_address字段中的Skewness加载了多少一个值。

有什么想法吗?

0 个答案:

没有答案