我有以下代码
import org.apache.spark.sql.streaming.Trigger
val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load();
jdf.createOrReplaceTempView("table")
val resultdf = spark.sql("select * from table as x inner join table as y on x.offset=y.offset")
resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(Trigger.ProcessingTime(1000)).start()
我得到以下异常
org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given input columns: [x.value, x.offset, x.key, x.timestampType, x.topic, x.timestamp, x.partition]; line 1 pos 50;
'Project [*]
+- 'Join Inner, ('x.offset = 'y.offset)
:- SubqueryAlias x
: +- SubqueryAlias table
: +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, offset#32L, timestamp#33, timestampType#34]
+- SubqueryAlias y
+- SubqueryAlias table
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, offset#32L, timestamp#33, timestampType#34]
我已将代码更改为此
import org.apache.spark.sql.streaming.Trigger
val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load();
val jdf1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load();
jdf.createOrReplaceTempView("table")
jdf1.createOrReplaceTempView("table1")
val resultdf = spark.sql("select * from table inner join table1 on table.offset=table1.offset")
resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(Trigger.ProcessingTime(1000)).start()
这很有效。但是,我不相信这是我正在寻找的解决方案。我希望能够使用原始SQL进行自联接,但不能像上面的代码那样制作数据帧的其他副本。还有其他方法吗?
答案 0 :(得分:4)
这是一个已知问题,将在2.4.0中修复。见https://issues.apache.org/jira/browse/SPARK-23406。现在你可以避免加入相同的DataFrame对象。
答案 1 :(得分:1)
您可以使用DataFrame API join
函数,而不是使用SQL语法:
jdf.as("df1").join(jdf.as("df2"), $"df1.offset" === $"df2.offset", "inner")