Question

我理解Spark 2.2.0中不支持两种不同数据帧的JOINS，但我试图进行自连接，因此只有一个流。以下是我的代码

val jdf = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "join_test")
    .option("startingOffsets", "earliest")
    .load();

jdf.printSchema

打印以下内容

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

现在，我在阅读完SO post

后运行下面的联接查询

jdf.as("jdf1").join(jdf.as("jdf2"), $"jdf1.key" === $"jdf2.key")

我得到以下例外

org.apache.spark.sql.AnalysisException: cannot resolve '`jdf1.key`' given input columns: [timestamp, value, partition, timestampType, topic, offset, key];;
'Join Inner, ('jdf1.key = 'jdf2.key)
:- SubqueryAlias jdf1
:  +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
+- SubqueryAlias jdf2
   +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]

Answer 1

我认为如果我们尝试加入相同的流数据帧或不同的数据帧，它将不会产生任何差异。因此，它将不受支持。

有两种方法可以实现它。

首先，您可以加入静态和流式数据帧。因此，作为批处理数据读取一次，然后作为流式df读取。第二种解决方案，你可以使用Kafka流。它提供流媒体数据的连接。

Spark 2.2.0是否支持Streaming Self-Joins？

1 个答案: