在Pyspark实施Hive UNION

时间:2018-02-07 11:21:28

标签: pyspark apache-spark-sql pyspark-sql

我正在尝试从文件中读取SQL并在Pyspark作业中运行它。 SQL的结构如下:

select <statements>
sort by rand()
limit 333333 
UNION ALL
select <statements>
sort by rand()
limit 666666

这是我运行时遇到的错误:

  

pyspark.sql.utils.ParseException:u&#34; \ nmismatched input&#39; UNION&#39;   期待{,&#39;。&#39;,&#39; [&#39;,&#39; OR&#39;,&#39; AND&#39;,&#39; IN&#39;,不,&#39; BETWEEN&#39;,&#39; LIKE&#39;,   RLIKE,&#39; IS&#39;,EQ,&#39;&lt; =&gt;&#39;,&#39;&lt;&gt;&#39;,&#39;!=&#39; ,&#39;&lt;&#;;,LTE,&#39;&gt;&#39;,GTE,&#39; +&#39;,&#39; - &#39;,&#39 *&#39 ;,   &#39; /&#39;,&#39;%&#39;,&#39; DIV&#39;,&#39;&amp;&#39;,&#39; |&#39;, &#39; ^&#39;}

这是因为 UNION ALL / UNION 不受spark SQL支持,或与解析出错有关吗?

1 个答案:

答案 0 :(得分:1)

PySpark和Hive在sql语句中支持UNION。 我能够运行以下配置单元声明

(SELECT * from x ORDER BY rand() LIMIT 50)
UNION
(SELECT * from y ORDER BY rand() LIMIT 50)

在pyspark你也可以这样做

df1=spark.sql('SELECT * from x ORDER BY rand() LIMIT 50')
df2=spark.sql('SELECT * from y ORDER BY rand() LIMIT 50')
df=df1.union(df2)