火花流写流问题

时间:2020-03-10 04:34:07

标签: json schema spark-streaming

我正在尝试根据文本文件中的JSON记录进行动态模式创建,因为每条记录将具有不同的模式。以下是我的代码。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json, col}

object streamingexample {
  def main(args: Array[String]): Unit = {
    val spark:SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("SparkByExamples")
      .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    import spark.implicits._
    val df1 = spark.readStream.textFile("C:\\Users\\sheol\\Desktop\\streaming")
    val newdf11=df1
    val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
    val df2 = df1.select(from_json($"value", schema_of_json(json_schema)).alias("value_new"))
    val df3 = df2.select($"value_new.*")
    df3.printSchema()
    df3.writeStream
      .option("truncate", "false")
      .format("console")
      .start()
      .awaitTermination()
     }
}

我收到以下错误。请提供有关如何修复代码的帮助。我试了很多无法弄清。

Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;

样本数据:

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

2 个答案:

答案 0 :(得分:1)

您已经知道,代码中的此语句导致了代码中的问题。

val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")

您可以通过以下类似方式获取json模式...

val dd: DataFrame =    spark.read.json("C:\\Users\\sheol\\Desktop\\streaming")
      dd.show()
/** you can use  val df1 = spark.readStream.textFile(yourfile) also **/

      val json_schema = dd.schema.json;
      println(json_schema)

结果:

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]}

您可以进一步完善您的要求,我会留给您

答案 1 :(得分:0)

发生此异常是因为您试图在启动流之前尝试从流中访问数据。 df3.printSchema()存在问题,请确保在流启动后调用此函数。