为spark-streaming RDD创建适当的模式

时间:2018-05-11 23:34:46

标签: json scala apache-spark-sql spark-dataframe spark-streaming

我正在使用来自kafka的dstream,它看起来像下面的记录。我一直在努力通过嵌套的JSON字段获得正确的架构设置。以下是我正在做的事情的样本。我缺少的是能够获得实际值而不是数组或rdd类型。感谢任何帮助。

val schema_array = StructType (Array(
        StructField("Source",StringType),
        StructField("Telemetry",StructType(Array(
          StructField("collection_end_time",LongType),
          StructField("collection_id",LongType),
          StructField("collection_start_time",LongType),
          StructField("encoding_path",StringType),
          StructField("msg_timestamp",LongType),
          StructField("node_id_str",StringType),
          StructField("subscription_id_str",StringType)
        ))),
          StructField("Rows",ArrayType(StructType(Array(
            StructField("Timestamp",LongType),
            StructField("Keys",StructType(Array(
              StructField("interface-name",StringType)))),
            StructField("Content",StructType(Array(
            StructField("bandwidth",LongType),
            StructField("input-data-rate",LongType),
            StructField("input-load",LongType),
            StructField("input-packet-rate",LongType),
            StructField("load-interval",LongType),
            StructField("output-data-rate",LongType),
            StructField("output-load",LongType),
            StructField("output-packet-rate",LongType),
            StructField("peak-input-data-rate",LongType),
            StructField("peak-input-packet-rate",LongType),
            StructField("peak-output-data-rate",LongType),
            StructField("peak-output-packet-rate",LongType),
            StructField("reliability",LongType))))))))))

    stream.foreachRDD { (rdd, time)  =>
        val data = rdd.map (record => record.value)
        val jsonData = spark.read.schema(schema_array).json(data)

        val result = jsonData.select("Rows.Keys.interface-name")
        result.show()

,代码如下所示:

+----------------+
|  interface-name|
+----------------+
|[Bundle-Ether56]|
+----------------+

我的结果是:

+----------------+
|  interface-name|
+----------------+
| Bundle-Ether56 |
+----------------+

预期结果是:

ro.sf.hwrotation

`

1 个答案:

答案 0 :(得分:1)

经过一段时间的挖掘后,我发现爆炸方法似乎适用于我想做的事情。我相信,因为我正在做一个forEach并且一次只能获得一个记录,所以我可以安全地压扁我的记录。

import org.apache.spark.sql.functions.explode
  val result = jsonData.select(explode($"Rows.Keys.interface-name"))
    result.show()

结果

+--------------+
|           col|
+--------------+ 
|Bundle-Ether56|
+--------------+