我有一个数据框,如下所示。我想将最后一列Trandata从字符串类型转换为MapType。输出看起来应该与我在第二张表中显示的相似。
我已经编写了udf,但是它需要字符串并将其转换为Maptype,因此我很难用sql.row作为输入来获得类似的输出。 :(
def stringToMap(value: String): Map[String, String] = {
var valMap = collection.mutable.Map[String, String]()
val values = value.split(",")
for (i <- values) {
valMap = valMap + (i.split("=")(0) -> i.split("=")(1))
}
return valMap
}
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID |CATEGORY|TRANDATA |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010 |A |threadID=123sada,ProcType=InfraLogging,TxnID=4mjx8wfogf
|03011 |A |threadID=xmjxe2j0jz,ProcType=InfraLogging,TxnID=4mjxe2j0tf
|09941 |D |compTxnID=xmawdew0tf,to=ABCD,threadID=4mjxe2j0jz,ProcType=InfraLogging
|00994 |D |compTxnID=xmjxe2j0tf,to=XYZA,threadID=34jxasde0jz,ProcType=InfraLogging
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
表2:期望的输出-第三列为MapType
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID |CATEGORY|TRANDATA |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010 |A |Map(threadID -> 123sada,ProcType -> InfraLogging,TxnID -> 4mjx8wfogf)
答案 0 :(得分:1)
对于Spark 2.4+,您可以split将字符串分成键值对,然后使用transform将键和值分成两个数组列,然后使用map_from_arrays创建最终映射
df.withColumn("entry", split('TRANDATA, ","))
.withColumn("key", expr("transform(entry, x -> split(x, '=')[0])"))
.withColumn("value", expr("transform(entry, x -> split(x, '=')[1])"))
.withColumn("map", map_from_arrays('key, 'value))
.drop("entry", "key", "value", "TRANDATA")
.show(false)
输出:
+---------+--------+----------------------------------------------------------------------------------------+
|MESSAGEID|CATEGORY|map |
+---------+--------+----------------------------------------------------------------------------------------+
|03010 |A |[threadID -> 123sada, ProcType -> InfraLogging, TxnID -> 4mjx8wfogf] |
|03011 |A |[threadID -> xmjxe2j0jz, ProcType -> InfraLogging, TxnID -> 4mjxe2j0tf] |
|09941 |D |[compTxnID -> xmawdew0tf, to -> ABCD, threadID -> 4mjxe2j0jz, ProcType -> InfraLogging] |
|00994 |D |[compTxnID -> xmjxe2j0tf, to -> XYZA, threadID -> 34jxasde0jz, ProcType -> InfraLogging]|
+---------+--------+----------------------------------------------------------------------------------------+