我正在尝试从文件中读取SQL并在Pyspark作业中运行它。 SQL的结构如下:
select <statements>
sort by rand()
limit 333333
UNION ALL
select <statements>
sort by rand()
limit 666666
这是我运行时遇到的错误:
pyspark.sql.utils.ParseException:u&#34; \ nmismatched input&#39; UNION&#39; 期待{,&#39;。&#39;,&#39; [&#39;,&#39; OR&#39;,&#39; AND&#39;,&#39; IN&#39;,不,&#39; BETWEEN&#39;,&#39; LIKE&#39;, RLIKE,&#39; IS&#39;,EQ,&#39;&lt; =&gt;&#39;,&#39;&lt;&gt;&#39;,&#39;!=&#39; ,&#39;&lt;&#;;,LTE,&#39;&gt;&#39;,GTE,&#39; +&#39;,&#39; - &#39;,&#39 *&#39 ;, &#39; /&#39;,&#39;%&#39;,&#39; DIV&#39;,&#39;&amp;&#39;,&#39; |&#39;, &#39; ^&#39;}
这是因为 UNION ALL / UNION 不受spark SQL支持,或与解析出错有关吗?
答案 0 :(得分:1)
PySpark和Hive在sql语句中支持UNION。 我能够运行以下配置单元声明
(SELECT * from x ORDER BY rand() LIMIT 50)
UNION
(SELECT * from y ORDER BY rand() LIMIT 50)
在pyspark你也可以这样做
df1=spark.sql('SELECT * from x ORDER BY rand() LIMIT 50')
df2=spark.sql('SELECT * from y ORDER BY rand() LIMIT 50')
df=df1.union(df2)