来自PySpark的合并查询失败

时间:2017-11-27 14:49:56

标签: apache-spark hive pyspark apache-spark-sql pyspark-sql

我正在从pyspark运行合并查询,但是关键字" merge"没有被火花识别。

17/11/27 14:39:34 ERROR JobScheduler: Error running job streaming job 1511793570000 ms.1
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/hdp/2.6.1.0-
129/spark2/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 159, in <lambda>
func = lambda t, rdd: old_func(rdd)
  File "/usr/repos/dataconnect/connect/spark/stream_kafka_consumer.py", line 66, in sendRecord
COLUMNS='sub.id, sub.name, sub.age'))
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/context.py", line 384, in sql
return self.sparkSession.sql(sqlQuery)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 545, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)

 ParseException: u"\nmismatched input 'merge' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)\n\n== SQL ==\nmerge into customer_partitioned using (select  case when all_updates.age <> customer_partitioned.age then 1       else 0     end as delete_flag,     all_updates.id as match_key,     all_updates.* from    all_updates left join customer_partitioned   on all_updates.id = customer_partitioned.id      union all     select 0, null, all_updates.*     from all_updates, customer_partitioned where     all_updates.id = customer_partitioned.id ) sub on customer_partitioned.id = sub.match_key when matched and delete_flag=1 then delete when matched and delete_flag=0 then   update set name=sub.name when not matched then   insert values(sub.id, sub.name, sub.age);\n^^^\n"

我可以将该查询直接复制到HIVE视图中,它不会运行。

merge into customer_partitioned using (select  case when all_updates.age <> customer_partitioned.age then 1       else 0     end as delete_flag,     all_updates.id as match_key,     all_updates.* from    all_updates left join customer_partitioned   on all_updates.id = customer_partitioned.id      union all     select 0, null, all_updates.*     from all_updates, customer_partitioned where     all_updates.id = customer_partitioned.id ) sub on customer_partitioned.id = sub.match_key when matched and delete_flag=1 then delete when matched and delete_flag=0 then   update set name=sub.name when not matched then   insert values(sub.id, sub.name, sub.age);

我的代码是这样的:

from pyspark.sql import HiveContext
sqlcontext = HiveContext(sc)
sql = 'merge into customer_partitioned using (select  case when all_updates.age <> customer_partitioned.age then 1       else 0     end as delete_flag,     all_updates.id as match_key,     all_updates.* from    all_updates left join customer_partitioned   on all_updates.id = customer_partitioned.id      union all     select 0, null, all_updates.*     from all_updates, customer_partitioned where     all_updates.id = customer_partitioned.id ) sub on customer_partitioned.id = sub.match_key when matched and delete_flag=1 then delete when matched and delete_flag=0 then   update set name=sub.name when not matched then   insert values(sub.id, sub.name, sub.age);'
sqlcontext.sql(sql)

1 个答案:

答案 0 :(得分:1)

  

我可以将该查询直接复制到HIVE视图中,它不会运行。

Spark不是Hive(即使启用了Hive支持)。它的查询语言旨在实现SQL03标准的一个子集,并且只保留与HQL的部分兼容性。

因此,不支持Hive的许多功能,包括select case when :i_schedfreq = 'M' then abs(months_between(trunc(:DateFrom, 'MM'), trunc(:DateTo,'MM'))) when :i_schedfreq = 'W' then ceil(ceil(trunc(:DateTo) - trunc(:DateFrom) )/7) when :i_schedfreq = 'H' then floor((:DateTo - :DateFrom)*24) else trunc(:DateTo)- trunc(:DateFrom) end as v_diff from dual 和更新或细粒度插入。

TL; DR 仅仅因为您可以在Hive中执行某些操作并不意味着您可以在Spark SQL中执行相同操作。