这是传入数据流的架构。我正在使用spark 2.3.2流处理数据。
val schema = StructType(Seq(
StructField("status", StringType),
StructField("data", StructType(Seq(
StructField("resultType", StringType),
StructField("result", ArrayType(StructType(Array(
StructField("metric", StructType(Seq(StructField("application", StringType),
StructField("component", StringType),
StructField("instance", StringType)))),
StructField("value", ArrayType(StringType))
))))
)
))))
这是我将架构应用于dstream的rdd的方式。
val df = rdd.toDS()
.selectExpr("cast (value as string) as myData")
.select(from_json($"myData", schema).as("myData"))
.select($"myData.data.*")
.select("result")
上面的代码产生以下输出
{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},
{"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},
{"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},
{"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},
{"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},
{"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},
]}
但是为了提取特征,我需要将以上内容转换为以下数据框
application component instance value1 value2
A S tp01.net:9072 1.542972576979E9 237006995456
A S tp02.net:9072 1.542972576979E9 237006995456
A S tp03.net:9072 1.542972576979E9 237006995456
B S bp03.net:9072 1.542972576979E9 270860144640
B S bp04.net:9072 1.542972576979E9 270860144640
B S ps01.net:9072 1.542972576979E9 135177400320
如您所见,每一行已经是一个爆炸行。对如何将数组值和结构选择到单个数据框中有任何想法吗?
谢谢