我有一个基于Scala的结构化流程序,需要执行基于Python的模型。
在旧版本的spark(1.6.x)中,我曾经通过将DStream转换为RDD而不是调用rdd.pipe方法来做到这一点。
但是,这种方法不适用于结构化流。它给出以下错误:
具有流源的查询必须使用writeStream.start()
执行
代码段如下:
val sourceDF = spark.readStream.option("header","true").schema(schema).csv("/Users/user/Desktop/spark_tutorial/")
val rdd: RDD[String] = sourceDF.rdd.map(row => row.mkString(","))
val pipedRDD: RDD[String] = rdd.pipe("/Users/user/Desktop/test.py")
import org.apache.spark.sql._
val rowRDD : RDD[Row] = pipedRDD.map(row => Row.fromSeq(row.split(",")))
val newSchema = <code to create new schema>
val newDF = spark.createDataFrame(rowRDD, newSchema)
val query = newDF.writeStream.format("console").outputMode(OutputMode.Append()).start
query.awaitTermination()
异常堆栈跟踪:
19/01/22 00:10:00 INFO StateStoreCoordinatorRef:已注册的StateStoreCoordinator端点 线程“主”中的异常org.apache.spark.sql.AnalysisException:带流源的查询必须使用writeStream.start();执行; FileSource [/ Users / user / Desktop / spark_tutorial /] 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $ .org $ apache $ spark $ sql $ catalyst $ analysis $ UnsupportedOperationChecker $$ throwError(UnsupportedOperationChecker.scala:374) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:37) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:35) 在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ foreachUp $ 1.apply(TreeNode.scala:126) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ foreachUp $ 1.apply(TreeNode.scala:126) 在scala.collection.immutable.List.foreach(List.scala:392) 在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $ .checkForBatch(UnsupportedOperationChecker.scala:35) 在org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51) 在org.apache.spark.sql.execution.QueryExecution.withCachedData $ lzycompute(QueryExecution.scala:62) 在org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60) 在org.apache.spark.sql.execution.QueryExecution.optimizedPlan $ lzycompute(QueryExecution.scala:66) 在org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) 在org.apache.spark.sql.execution.QueryExecution.sparkPlan $ lzycompute(QueryExecution.scala:72) 在org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) 在org.apache.spark.sql.execution.QueryExecution.executedPlan $ lzycompute(QueryExecution.scala:77) 在org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) 在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:80) 在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 在org.apache.spark.sql.Dataset.rdd $ lzycompute(Dataset.scala:2975) 在org.apache.spark.sql.Dataset.rdd(Dataset.scala:2973) 在Test $ .main(Test.scala:20) 在Test.main(Test.scala)
有什么建议吗?