在火花结构化流媒体上收到错误消息“必须使用writeStream.start()执行带有流媒体源的查询”

时间:2018-12-27 11:03:36

标签: apache-spark apache-spark-sql spark-structured-streaming

在spark结构流之上执行spark SQL时遇到一些问题。 PFA错误。

这是我的代码

 object sparkSqlIntegration {
    def main(args: Array[String]) {
     val spark = SparkSession
         .builder
         .appName("StructuredStreaming")
         .master("local[*]")
         .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
         .config("spark.sql.streaming.checkpointLocation", "file:///C:/checkpoint")
         .getOrCreate()

       setupLogging()
         val userSchema = new StructType().add("name", "string").add("age", "integer")
       // Create a stream of text files dumped into the logs directory
       val rawData =  spark.readStream.option("sep", ",").schema(userSchema).csv("file:///C:/Users/R/Documents/spark-poc-centri/csvFolder")

       // Must import spark.implicits for conversion to DataSet to work!
       import spark.implicits._
      rawData.createOrReplaceTempView("updates")
       val sqlResult= spark.sql("select * from updates")
       println("sql results here")
       sqlResult.show()
       println("Otheres")
       val query = rawData.writeStream.outputMode("append").format("console").start()

       // Keep going until we're stopped.
       query.awaitTermination()

       spark.stop()

    }
 }

在执行过程中,出现以下错误。由于我是流媒体新手,谁能告诉我如何在Spark结构化流媒体上执行Spark SQL查询

2018-12-27 16:02:40 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, LAPTOP-5IHPFLOD, 6829, None)
2018-12-27 16:02:41 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6731787b{/metrics/json,null,AVAILABLE,@Spark}
sql results here
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///C:/Users/R/Documents/spark-poc-centri/csvFolder]
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
    at scala.collection.immutable.List.foreach(List.scala:392)

enter image description here

1 个答案:

答案 0 :(得分:0)

您不需要这些行

import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")

最重要的是,不需要select *。当您打印数据框时,您将已经看到所有列。因此,您也不需要注册临时视图即可为其命名。

当您format("console")时,就不需要.show()


Refer to the Spark examples,用于从网络套接字读取并输出到控制台。

val words = // omitted ... some Streaming DataFrame

// Generating a running word count
val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()

query.awaitTermination()

带走-使用.select().groupBy()之类的DataFrame操作,而不是原始SQL


或者您可以使用Spark Streaming,as shown in those examples,需要对每个流批次进行foreachRDD,然后将其转换为DataFrame,以便查询

/** Case class for converting RDD to DataFrame */
case class Record(word: String)

val words = // omitted ... some DStream

// Convert RDDs of the words DStream to DataFrame and run SQL query
words.foreachRDD { (rdd: RDD[String], time: Time) =>
  // Get the singleton instance of SparkSession
  val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
  import spark.implicits._

  // Convert RDD[String] to RDD[case class] to DataFrame
  val wordsDataFrame = rdd.map(w => Record(w)).toDF()

  // Creates a temporary view using the DataFrame
  wordsDataFrame.createOrReplaceTempView("words")

  // Do word count on table using SQL and print it
  val wordCountsDataFrame =
    spark.sql("select word, count(*) as total from words group by word")
  println(s"========= $time =========")
  wordCountsDataFrame.show()
}

ssc.start()
ssc.awaitTermination()