我正在尝试使用Java在SPARK中的SQL查询下面运行:
Dataset<Row> perIDDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_PER_PER").load();
perIDDf.createOrReplaceTempView("CI_PER_PER");
Dataset<Row> perPerDF = sparkSession.sql("select per_id1,per_id2 " +
"from CI_PER_PER " +
"start with per_id1='2001822000' " +
"connect by prior per_id1=per_id2");
perPerDF.show(10,false);
我遇到了错误:
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'with' expecting <EOF>(line 1, pos 45)
== SQL ==
select per_id1,per_id2 from CI_PER_PER start with per_id1='2001822000' connect by prior per_id1=per_id2
---------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at com.tfmwithspark.TestMaterializedView.main(TestMaterializedView.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
基本上,我正在尝试在SPARK中使用分层查询。
不支持吗?
SPARK VERSION:2.3.0
答案 0 :(得分:1)
Spark当前不支持分层查询,查询中也不支持递归。以最有限的方式使用WITH。
您可以对此进行粗略估计,但这很艰巨。这是一种方法,但我并不推荐这样做:http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/
答案 1 :(得分:1)
此PR已提高check this
解决以下问题:
parent_query = """
SELECT asset_id as parent_id FROM {0}.{1}
where name = 'ROOT'
""".format(db_name,table_name)
parent_df = spark.sql(parent_query)
final_df = parent_df
child_query = """
SELECT parent_id as parent_to_drop,asset_id
FROM
{0}.{1}
""".format(db_name,table_name)
child_df = spark.sql(child_query)
count = 1
while count > 0:
join_df = child_df.join(parent_df,(child_df.parent_to_drop == parent_df.parent_id)) \
.drop("parent_to_drop") \
.drop("parent_id") \
.withColumnRenamed("asset_id","parent_id")
count = join_df.count()
final_df = final_df.union(join_df)
parent_df = join_df
print("----------final-----------")
print(final_df.count())
final_df.show()
result :
----------final-----------
8
+---------+
|parent_id|
+---------+
| 0|
| 1|
| 5|
| 2|
| 7|
| 4|
| 3|
| 6|
+---------+