Hive Query返回笛卡尔积而不是内连接

时间:2014-07-29 16:21:52

标签: sql hive hiveql

这是我发送给HIVE的查询类型:

SELECT BigTable.nicefield,LargeTable.* 
FROM LargeTable INNER JOIN BigTable 
    ON (
        LargeTable.joinfield1of4 = BigTable.joinfield1of4 
        AND LargeTable.joinfield2of4 = BigTable.joinfield2of4 
    )   
WHERE LargeTable.joinfield3of4=20140726 AND LargeTable.joinfield4of4=15 AND BigTable.joinfield3of4=20140726 AND BigTable.joinfield4of4=15
    AND LargeTable.filterfiled1of2=123456
    AND LargeTable.filterfiled2of2=98765
    AND LargeTable.joinfield2of4=12 
    AND LargeTable.joinfield1of4='iwanttolikehive'       

返回2418025行。问题在于

SELECT *  
FROM LargeTable 
WHERE joinfield3of4=20140726 AND joinfield4of4=15
    AND filterfiled1of2=123456 
    AND filterfiled2of2=98765
    AND joinfield2of4=12 
    AND joinfield1of4='iwanttolikehive'

返回1555行,同样如下:

SELECT *  
FROM BigTable 
WHERE joinfield3of4=20140726 AND joinfield4of4=15
    AND joinfield2of4=12 
    AND joinfield1of4='iwanttolikehive'

请注意 1555 ^ 2 = 2418025

1 个答案:

答案 0 :(得分:2)

事实证明,查询的正确版本应为:

SELECT bt.nicefield,LargeTable.* 
FROM LargeTable INNER JOIN 
    (
    SELECT nicefield, joinfield1of4,joinfield2of4, count(*) as rows
    FROM BigTable
    WHERE joinfield3of4=20140726 ANDjoinfield4of4=15
    GROUP BY nicefield, joinfield1of4,joinfield2of4
    ) bt 
    ON (
        LargeTable.joinfield1of4 = bt.joinfield1of4 
        AND LargeTable.joinfield2of4 = bt.joinfield2of4 
    )   
WHERE LargeTable.joinfield3of4=20140726 AND LargeTable.joinfield4of4=15
    AND LargeTable.filterfiled1of2=123456
    AND LargeTable.filterfiled2of2=98765
    AND LargeTable.joinfield2of4=12 
    AND LargeTable.joinfield1of4='iwanttolikehive'

问题是在原始查询中,BigTable上的联接返回了重复项。

这不是问题,查询只需要仔细阅读! 我希望这有帮助!