我正在使用来自kafka的dstream,它看起来像下面的记录。我一直在努力通过嵌套的JSON字段获得正确的架构设置。以下是我正在做的事情的样本。我缺少的是能够获得实际值而不是数组或rdd类型。感谢任何帮助。
val schema_array = StructType (Array(
StructField("Source",StringType),
StructField("Telemetry",StructType(Array(
StructField("collection_end_time",LongType),
StructField("collection_id",LongType),
StructField("collection_start_time",LongType),
StructField("encoding_path",StringType),
StructField("msg_timestamp",LongType),
StructField("node_id_str",StringType),
StructField("subscription_id_str",StringType)
))),
StructField("Rows",ArrayType(StructType(Array(
StructField("Timestamp",LongType),
StructField("Keys",StructType(Array(
StructField("interface-name",StringType)))),
StructField("Content",StructType(Array(
StructField("bandwidth",LongType),
StructField("input-data-rate",LongType),
StructField("input-load",LongType),
StructField("input-packet-rate",LongType),
StructField("load-interval",LongType),
StructField("output-data-rate",LongType),
StructField("output-load",LongType),
StructField("output-packet-rate",LongType),
StructField("peak-input-data-rate",LongType),
StructField("peak-input-packet-rate",LongType),
StructField("peak-output-data-rate",LongType),
StructField("peak-output-packet-rate",LongType),
StructField("reliability",LongType))))))))))
stream.foreachRDD { (rdd, time) =>
val data = rdd.map (record => record.value)
val jsonData = spark.read.schema(schema_array).json(data)
val result = jsonData.select("Rows.Keys.interface-name")
result.show()
,代码如下所示:
+----------------+
| interface-name|
+----------------+
|[Bundle-Ether56]|
+----------------+
我的结果是:
+----------------+
| interface-name|
+----------------+
| Bundle-Ether56 |
+----------------+
预期结果是:
ro.sf.hwrotation
`
答案 0 :(得分:1)
经过一段时间的挖掘后,我发现爆炸方法似乎适用于我想做的事情。我相信,因为我正在做一个forEach并且一次只能获得一个记录,所以我可以安全地压扁我的记录。
import org.apache.spark.sql.functions.explode
val result = jsonData.select(explode($"Rows.Keys.interface-name"))
result.show()
结果
+--------------+
| col|
+--------------+
|Bundle-Ether56|
+--------------+