我正在从blob位置获取数据,如下所示。
| NUM_ID| Event|
+-------+-----------------------------------------------------------------------------------------------------------------------------------+
|XXXXX01|[{"SN":"SIG1","E":1571599398000,"V":19.79},{"SN":"SIG1","E":1571599406000,"V":19.80},{"SN":"SIG2","E":1571599406000,"V":25.30},{...|
|XXXXX02|[{"SN":"SIG1","E":1571599414000,"V":19.79},{"SN":"SIG2","E":1571599414000,"V":19.80},{"SN":"SIG2","E":1571599424000,"V":25.30},{...|
如果我们只一行,将如下所示。
|XXXXX01|[{"SN":"SIG1","E":1571599398000,"V":19.79},{"SN":"SIG1","E":1571599406000,"V":19.80},{"SN":"SIG1","E":1571599414000,"V":19.20},{"SN":"SIG2","E":1571599424000,"V":25.30},{"SN":"SIG2","E":1571599432000,"V":19.10},{"SN":"SIG3","E":1571599440000,"V":19.10},{"SN":"SIG3","E":1571599448000,"V":19.10},{"SN":"SIG3","E":1571599456000,"V":19.10},{"SN":"SIG3","E":1571599396000,"V":19.79},{"SN":"SIG3","E":1571599404000,"V":19.79}]
事件列与E,V对具有不同的信号。
此数据框的架构如下所示。
scala> df.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- Event: string (nullable = true)
我想接收一些信号(假设我只需要SIG1和SIG3)以及E,V对作为新列,如下所示。
+-------+--------+--------------+------+
| NUM_ID| Event| E| V|
+-------+--------+--------------+------+
|XXXXX01| SIG1| 1571599398000| 19.79|
|XXXXX01| SIG1| 1571599406000| 19.80|
|XXXXX01| SIG1| 1571599414000| 19.20|
|XXXXX01| SIG3| 1571599440000| 19.10|
|XXXXX01| SIG3| 1571599448000| 19.10|
|XXXXX01| SIG3| 1571599406000| 19.10|
|XXXXX01| SIG3| 1571599396000| 19.70|
|XXXXX01| SIG3| 1571599404000| 19.70|
+-------+--------+--------------+------+
,每个 NUM_ID 的最终输出应如下所示。
+-------+--------------+------+------+
| NUM_ID| E|SIG1 V|SIG3 V|
+-------+--------------+------+------+
|XXXXX01| 1571599398000| 19.79| null|
|XXXXX01| 1571599406000| 19.80| 19.70|
|XXXXX01| 1571599414000| 19.20| null|
|XXXXX01| 1571599440000| null| 19.10|
|XXXXX01| 1571599448000| null| 19.10|
|XXXXX01| 1571599448000| null| 19.10|
|XXXXX01| 1571599406000| 19.80| 19.10|
|XXXXX01| 1571599396000| null| 19.70|
|XXXXX01| 1571599404000| null| 19.70|
+-------+--------------+------+------+
感谢任何潜在客户。 预先感谢!
答案 0 :(得分:1)
在“事件上方”列中连续包含多个记录,即数据必须先平整,然后才能进一步处理。可以通过对DataFrame进行平面图转换操作来实现数据平面化。
该方法是创建一个包含所有必需键和值的扁平JSON数据帧,最后通过Spark read json API将JSON转换为DataFrame。
val mapper = new ObjectMapper()
import spark.implicits._
val flatDF = df.flatMap(row => {
val numId = row.getAs[String]("NUM_ID")
val event = row.getAs[String]("Event")
val data = mapper.readValue(event, classOf[Array[java.util.Map[String, String]]])
data.map(jsonMap => {
jsonMap.put("NUM_ID", numId)
mapper.writeValueAsString(jsonMap)
})
})
val finalDF = spark.read.json(flatDF)
//finalDF Outout
+-------------+-------+----+-----+
| E| NUM_ID| SN| V|
+-------------+-------+----+-----+
|1571599398000|XXXXX01|SIG1|19.79|
|1571599406000|XXXXX01|SIG1| 19.8|
|1571599406000|XXXXX01|SIG2| 25.3|
|1571599414000|XXXXX02|SIG1|19.79|
|1571599414000|XXXXX02|SIG2| 19.8|
|1571599424000|XXXXX02|SIG2| 25.3|
+-------------+-------+----+-----+
答案 1 :(得分:0)
这是通过获取json对象并按如下所示展开列来获得的。
val schema = ArrayType(StructType(Seq(StructField("SN", StringType), StructField("E", StringType), StructField("V", StringType))))
val structDF = fromBlobDF.withColumn("sig_array", from_json($"Event", schema))
val signalsDF = structDF.withColumn("sig_array", explode($"sig_array")).withColumn("SIGNAL", $"sig_array.SN").withColumn("E", $"sig_array.E").withColumn("V", $"sig_array.V").select("NUM_ID","E","SIGNAL","V")