我在dataframe中有数据,该数据帧是从azure eventhub获得的。 然后,我将这些数据转换为json对象,并将所需的数据存储到数据集中,如下所示。
用于从eventhub获取数据并将其存储到数据帧中的代码。
val connectionString = ConnectionStringBuilder(<ENDPOINT URL>)
.setEventHubName(<EVENTHUB NAME>).build
val currTime = Instant.now
val ehConf = EventHubsConf(connectionString)
.setConsumerGroup("<CONSUMER GRP>")
.setStartingPosition(EventPosition
.fromEnqueuedTime(currTime.minus(Duration.ofMinutes(30))))
.setEndingPosition(EventPosition.fromEnqueuedTime(currTime))
val reader = spark.read.format("eventhubs").options(ehConf.toMap).load()
var SIGNALS = reader
.select(get_json_object(($"body").cast("string"),"$.NUM").alias("NUM"),
get_json_object(($"body").cast("string"),"$.SIG1").alias("SIG1"),
get_json_object(($"body").cast("string"),"$.SIG2").alias("SIG2"),
get_json_object(($"body").cast("string"),"$.SIG3").alias("SIG3"),
get_json_object(($"body").cast("string"),"$.SIG4").alias("SIG4")
)
val SIGNALSFiltered = SIGNALS.filter(col("SIG1").isNotNull &&
col("SIG2").isNotNull && col("SIG3").isNotNull && col("SIG4").isNotNull)
在 SIGNALSFiltered 处获得的数据如下所示。
+-----------------+--------------------+--------------------+--------------------+--------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|XXXXX01|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX02|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|
|XXXXX03|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX04|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX05|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX06|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|
|XXXXX07|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX08|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
如果我们检查单个行的全部数据,将如下所示。
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825},{"TIME":1569560475000,"VALUE":3.7812},{"TIME":1569560483000,"VALUE":1.7812},{"TIME":1569560491000,"VALUE":7.7875}]|
[{"TIME":1569560537000,"VALUE":3.7825},{"TIME":1569560481000,"VALUE":9.7825},{"TIME":1569560489000,"VALUE":5.7825},{"TIME":1569560497000,"VALUE":34.7825}]|
[{"TIME":1569560505000,"VALUE":34.7825},{"TIME":1569560513000,"VALUE":9.7825},{"TIME":1569560521000,"VALUE":34.7825},{"TIME":1569560527000,"VALUE":4.7825}]|
[{"TIME":1569560535000,"VALUE":7.7825},{"TIME":1569560479000,"VALUE":35.7825},{"TIME":1569560487000,"VALUE":3.7825}]
我想将每个信号列中的每个时间值对转换为新行。
有什么方法可以如下转换基本数据集?列中的每个元素都应转换为新行。
+-----------------+-----------------------------+---------------------------------------+-----------------------------+
| NUM| SIG1 TIME| SIG1 VALUE| SIG2 TIME| SIG2 VALUE| SIG3 TIME| SIG3 VALUE| SIG4 TIME| SIG4 VALUE |
+-----------------+-----------------------------+---------------------------------------+-----------------------------+
|XXXXX01|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX01|1569560531000| 1.7825|1569560531000| 1.7825| null | null |1569560531000| 2.7825|
|XXXXX01|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 7.7825|
|XXXXX02|1569560531000| 7.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX02|null | null |1569560531000| 5.7825|1569560531000| 7.7825|1569560531000| 5.7825|
|XXXXX02|1569560531000| 3.7825|1569560531000| 4.7825|1569560531000| 8.7825|1569560531000| 2.7825|
|XXXXX02|1569560531000| 5.7825|1569560531000| 7.7825|1569560531000| 9.7825|1569560531000| 2.7825|
任何线索或帮助都值得赞赏!预先感谢。
答案 0 :(得分:1)
您可以使用explode
功能来做到这一点。它将为数组中的每个元素生成新行,然后您可以使用点语法(访问结构的字段)来访问字段time
和value
。这是第一列的简单示例:
data
.withColumn("sig1_obj", explode($"SIG1"))
.withColumn("sig1_time", $"sig1_obj.time")
.withColumn("sig1_value", $"sig1_obj.value")
.show()
+--------------------+--------------------+-------------+----------+
| SIG1| sig1_obj| sig1_time|sig1_value|
+--------------------+--------------------+-------------+----------+
|[[1569560531000, ...|[1569560531000, 3...|1569560531000| 3.7825|
|[[1569560531000, ...|[1569560475000, 3...|1569560475000| 3.7812|
|[[1569560531000, ...|[1569560483000, 1...|1569560483000| 1.7812|
|[[1569560531000, ...|[1569560491000, 7...|1569560491000| 7.7875|
+--------------------+--------------------+-------------+----------+
类似地,您也可以处理其他列。
还要注意,使用此技术将要乘以数据,对于第二列,您将获得n*m
行,其中n
是sig1数组中的元素数,而{{1} }是sig2数组中元素的数量,依此类推。如果您不希望这样做,则可以爆炸单独数据框中的每一列,然后在某些字段上将这些数据框完全外部连接(也许对每个m
的行进行row_number并在NUM
col上进行连接)和row_number)
编辑:
由于在sig列中具有StringType,因此可以首先使用NUM
函数将此String字段转换为Structs数组。在您的示例中,可以执行以下操作:
from_json
答案 1 :(得分:1)
scala> SIGNALSFiltered.show(false)
+-------+--------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
|NUM |SIG1 |SIG2 |SIG3 |SIG4 |
+-------+--------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
|XXXXX01|[{"TIME":11,"VALUE":3.7825},{"TIME":12,"VALUE":3.7812},{"TIME":13,"VALUE":3.7812},{"TIME":14,"VALUE":34.7875}]|[{"TIME":21,"VALUE":3.7825},{"TIME":22,"VALUE":34.7825},{"TIME":23,"VALUE":34.7825},{"TIME":24,"VALUE":34.7825}]|[{"TIME":31,"VALUE":34.7825},{"TIME":32,"VALUE":34.7825},{"TIME":33,"VALUE":34.7825},{"TIME":34,"VALUE":34.7825}]|[{"TIME":41,"VALUE":34.7825},{"TIME":42,"VALUE":34.7825},{"TIME":43,"VALUE":34.7825}]|
+-------+--------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
scala> import scala.collection.mutable.ListBuffer
scala> import org.apache.spark.sql.functions.arrays_zip
scala> import scala.util.parsing.json._
scala> def flatTime:UserDefinedFunction = udf((json:String) => {
| val pars = JSON.parseFull(json)
| var outputList = new ListBuffer[String]()
| pars.foreach{ x =>
| val y = x.asInstanceOf[List[Any]]
| y.foreach{ zz =>
| val z = zz.asInstanceOf[Map[String,Double]]
| val tempStr = """[{"TIME" : """ + z("TIME").toString + """ ,"VALUE": """ + z("VALUE").toString + """}]"""
| outputList += tempStr
| }
| }
| outputList.toList
| })
scala> SIGNALSFiltered.withColumn("var", explode(arrays_zip(flatTime(col("SIG1")),flatTime(col("SIG2")),flatTime(col("SIG3")),flatTime(col("SIG4"))))).select(col("NUM"), col("var.0").alias("SIG1"),col("var.1").alias("SIG2"),col("var.2").alias("SIG3"),col("var.3").alias("SIG4")).show(false)
+-------+-----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+
|NUM |SIG1 |SIG2 |SIG3 |SIG4 |
+-------+-----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+
|XXXXX01|[{"TIME" : 11.0 ,"VALUE": 3.7825}] |[{"TIME" : 21.0 ,"VALUE": 3.7825}] |[{"TIME" : 31.0 ,"VALUE": 34.7825}]|[{"TIME" : 41.0 ,"VALUE": 34.7825}]|
|XXXXX01|[{"TIME" : 12.0 ,"VALUE": 3.7812}] |[{"TIME" : 22.0 ,"VALUE": 34.7825}]|[{"TIME" : 32.0 ,"VALUE": 34.7825}]|[{"TIME" : 42.0 ,"VALUE": 34.7825}]|
|XXXXX01|[{"TIME" : 13.0 ,"VALUE": 3.7812}] |[{"TIME" : 23.0 ,"VALUE": 34.7825}]|[{"TIME" : 33.0 ,"VALUE": 34.7825}]|[{"TIME" : 43.0 ,"VALUE": 34.7825}]|
|XXXXX01|[{"TIME" : 14.0 ,"VALUE": 34.7875}]|[{"TIME" : 24.0 ,"VALUE": 34.7825}]|[{"TIME" : 34.0 ,"VALUE": 34.7825}]|null |
+-------+-----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+