HQL查询运行良好,但使用PySpark使其无限期运行

时间:2019-03-28 18:42:45

标签: python pyspark hql

我有一个HQL查询,当我在dbeaver中针对我的Hadoop实例运行它时,它运行良好(数据库/表名称已删除)

select * from (select DISTINCT UPPER(CONCAT(CONCAT(trim(lm.OriginCity),', '),trim(lm.OriginState))) as OriginCitySt
              from <db1>.<table1> lm 
              LEFT JOIN <db2>.<table2> lt on trim(split(lt.lane, '-')[0]) = UPPER(CONCAT(CONCAT(trim(lm.OriginCity),', '),trim(lm.OriginState)))
              WHERE lm.origincountry = 'US'
              AND lt.lane IS NULL) a
union all
              select * from (select distinct UPPER(CONCAT(CONCAT(trim(lm.DestinationCity),', '),trim(lm.DestinationState))) as DestCitySt
              from <db1>.<table1> lm 
              LEFT JOIN <db2>.<table2> lt on trim(split(lt.lane, '-')[1]) = UPPER(CONCAT(CONCAT(trim(lm.DestinationCity),', '),trim(lm.DestinationState)))
              WHERE lm.origincountry = 'US'
              AND lt.lane IS NULL) b

我在linux盒子上有一个应用程序,该应用程序使用pyspark连接到hive并运行此查询,但是当我这样做时,它被卡在看起来像这样的行上。 enter image description here

当我从查询中删除“左联接”并使其满足以下条件

select * from (select DISTINCT UPPER(CONCAT(CONCAT(trim(lm.OriginCity),', '),trim(lm.OriginState))) as OriginCitySt
              from <db1>.<table1> lm
              WHERE lm.origincountry = 'US') a
union all
          select * from (select distinct UPPER(CONCAT(CONCAT(trim(lm.DestinationCity),', '),trim(lm.DestinationState))) as DestCitySt
          from <db1>.<table1> lm 
          WHERE lm.origincountry = 'US') b

运行正常。所以我知道联接是问题所在,而且我很确定它是“ trim(split(lt.lane,'-')[0])”部分,但是问题是为什么?

0 个答案:

没有答案