在基于Scala的Spark结构化流程序中执行基于python的模型

时间:2019-01-22 08:37:03

标签: scala apache-spark rdd spark-structured-streaming apache-spark-dataset

我有一个基于Scala的结构化流程序,需要执行基于Python的模型。

在旧版本的spark(1.6.x)中,我曾经通过将DStream转换为RDD而不是调用rdd.pipe方法来做到这一点。

但是,这种方法不适用于结构化流。它给出以下错误:

  

具有流源的查询必须使用writeStream.start()

执行

代码段如下:

val sourceDF = spark.readStream.option("header","true").schema(schema).csv("/Users/user/Desktop/spark_tutorial/")
val rdd: RDD[String] = sourceDF.rdd.map(row => row.mkString(","))
val pipedRDD: RDD[String] = rdd.pipe("/Users/user/Desktop/test.py")

import org.apache.spark.sql._
val rowRDD : RDD[Row] = pipedRDD.map(row => Row.fromSeq(row.split(",")))


val newSchema = <code to create new schema>

val newDF = spark.createDataFrame(rowRDD, newSchema)
val query = newDF.writeStream.format("console").outputMode(OutputMode.Append()).start
query.awaitTermination()

异常堆栈跟踪:

  

19/01/22 00:10:00 INFO StateStoreCoordinatorRef:已注册的StateStoreCoordinator端点   线程“主”中的异常org.apache.spark.sql.AnalysisException:带流源的查询必须使用writeStream.start();执行;   FileSource [/ Users / user / Desktop / spark_tutorial /]       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $ .org $ apache $ spark $ sql $ catalyst $ analysis $ UnsupportedOperationChecker $$ throwError(UnsupportedOperationChecker.scala:374)       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:37)       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:35)       在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ foreachUp $ 1.apply(TreeNode.scala:126)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ foreachUp $ 1.apply(TreeNode.scala:126)       在scala.collection.immutable.List.foreach(List.scala:392)       在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $ .checkForBatch(UnsupportedOperationChecker.scala:35)       在org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)       在org.apache.spark.sql.execution.QueryExecution.withCachedData $ lzycompute(QueryExecution.scala:62)       在org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)       在org.apache.spark.sql.execution.QueryExecution.optimizedPlan $ lzycompute(QueryExecution.scala:66)       在org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)       在org.apache.spark.sql.execution.QueryExecution.sparkPlan $ lzycompute(QueryExecution.scala:72)       在org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)       在org.apache.spark.sql.execution.QueryExecution.executedPlan $ lzycompute(QueryExecution.scala:77)       在org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)       在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:80)       在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)       在org.apache.spark.sql.Dataset.rdd $ lzycompute(Dataset.scala:2975)       在org.apache.spark.sql.Dataset.rdd(Dataset.scala:2973)       在Test $ .main(Test.scala:20)       在Test.main(Test.scala)

有什么建议吗?

0 个答案:

没有答案