Kafka接收Spark中的JSON到Dataframe的JSON数组

时间:2018-12-16 09:24:25

标签: json scala apache-spark spark-streaming-kafka

我正在使用Spark结构化流在Scala中编写一个Spark应用程序,该应用程序从Kafka接收一些JSON格式的数据。此应用程序可以接收以这种方式设置格式的单个或多个JSON对象:

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<p>content</p>
<div class="robig">robig</div>
<p>content</p>
<p>content</p>

我试图定义一个StructType:

[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]

但是它不起作用。 我解析JSON的实际代码:

var schema = StructType(
                  Array(
                        StructField("key1",DataTypes.StringType),
                        StructField("key2",DataTypes.StringType)
             ))

我想在这样的数据框中获取此JSON对象

var data = (this.stream).getStreamer().load()
  .selectExpr("CAST (value AS STRING) as json")
  .select(from_json($"json",schema=schema).as("data"))

有人可以帮助我吗? 谢谢!

3 个答案:

答案 0 :(得分:0)

由于您输入的字符串是Array中的JSON,所以一种方法是编写一个UDF来解析Array,然后展开已解析的Array。下面是完整的代码,并解释了每个步骤。我已经批量编写了它,但是可以用最少的更改将其用于流式传输。

object JsonParser{

  //case class to parse the incoming JSON String
  case class JSON(key1: String, key2: String)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.
      builder().
      appName("JSON").
      master("local").
      getOrCreate()

    import spark.implicits._
    import org.apache.spark.sql.functions._

    //sample JSON array String coming from kafka
    val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")

    //UDF to parse JSON array String
    val jsonConverter = udf { jsonString: String =>
      val mapper = new ObjectMapper()
      mapper.registerModule(DefaultScalaModule)
      mapper.readValue(jsonString, classOf[Array[JSON]])
    }

    val df = str.toDF("json") //json String column
      .withColumn("array", jsonConverter($"json")) //parse the JSON Array
      .withColumn("json", explode($"array")) //explode the Array
      .drop("array") //drop unwanted columns
      .select("json.*") //explode the JSON to separate columns

    //display the DF
    df.show()
    //+------+------+
    //|  key1|  key2|
    //+------+------+
    //|value1|value2|
    //|value3|value4|
    //+------+------+

  }
}

答案 1 :(得分:0)

在Spark 3.0.0和Scala 2.12.10中,这对我来说效果很好。我使用schema_of_json以适合from_json的格式获取数据的模式,并在链的最后一步应用explode和*运算符进行相应的扩展。

CacheOpenException

使用结果字符串作为您的架构:'array ',如下所示:

// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])

scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]

scala> df.show()
+--------------------+
|                json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+

scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>

仅供参考,没有恒星膨胀,中间结果如下:

// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]

scala> parsedJson1.show()
+--------------------+
|         parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+

scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]

scala> data.show()
+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+

答案 2 :(得分:0)

  1. 您可以将 ArrayType 添加到您的架构中,而 from_json 会 将数据转换为 json。
var schema = ArrayType(StructType(
                  Array(
                        StructField("key1", DataTypes.StringType),
                        StructField("key2", DataTypes.StringType)
             )))
  1. 将其分解以获取每行中的 json 数组元素。
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
|        jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
  1. 选择 json 键
explodedDf.select("jsonData.*").show
+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+