在Spark Dataframe中将字符串数据类型列转换为MapType

时间:2020-10-04 04:46:41

标签: apache-spark apache-spark-sql

我有一个数据框,如下所示。我想将最后一列Trandata从字符串类型转换为MapType。输出看起来应该与我在第二张表中显示的相似。

我已经编写了udf,但是它需要字符串并将其转换为Maptype,因此我很难用sql.row作为输入来获得类似的输出。 :(

def stringToMap(value: String): Map[String, String] = {
  var valMap = collection.mutable.Map[String, String]()
  val values = value.split(",")
  for (i <- values) {
    valMap = valMap + (i.split("=")(0) -> i.split("=")(1))
  }
  return valMap
}


+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID     |CATEGORY|TRANDATA                                                                                                                                                                                                                                                                                       |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010         |A       |threadID=123sada,ProcType=InfraLogging,TxnID=4mjx8wfogf
|03011         |A       |threadID=xmjxe2j0jz,ProcType=InfraLogging,TxnID=4mjxe2j0tf
|09941         |D       |compTxnID=xmawdew0tf,to=ABCD,threadID=4mjxe2j0jz,ProcType=InfraLogging
|00994         |D       |compTxnID=xmjxe2j0tf,to=XYZA,threadID=34jxasde0jz,ProcType=InfraLogging
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

表2:期望的输出-第三列为MapType

+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID     |CATEGORY|TRANDATA                                                                                                                                                                                                                                                                                       |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010         |A       |Map(threadID -> 123sada,ProcType -> InfraLogging,TxnID -> 4mjx8wfogf)

1 个答案:

答案 0 :(得分:1)

对于Spark 2.4+,您可以split将字符串分成键值对,然后使用transform将键和值分成两个数组列,然后使用map_from_arrays创建最终映射

df.withColumn("entry", split('TRANDATA, ","))
  .withColumn("key", expr("transform(entry, x -> split(x, '=')[0])"))
  .withColumn("value", expr("transform(entry, x -> split(x, '=')[1])"))
  .withColumn("map", map_from_arrays('key, 'value))
  .drop("entry", "key", "value", "TRANDATA")
  .show(false)

输出:

+---------+--------+----------------------------------------------------------------------------------------+
|MESSAGEID|CATEGORY|map                                                                                     |
+---------+--------+----------------------------------------------------------------------------------------+
|03010    |A       |[threadID -> 123sada, ProcType -> InfraLogging, TxnID -> 4mjx8wfogf]                    |
|03011    |A       |[threadID -> xmjxe2j0jz, ProcType -> InfraLogging, TxnID -> 4mjxe2j0tf]                 |
|09941    |D       |[compTxnID -> xmawdew0tf, to -> ABCD, threadID -> 4mjxe2j0jz, ProcType -> InfraLogging] |
|00994    |D       |[compTxnID -> xmjxe2j0tf, to -> XYZA, threadID -> 34jxasde0jz, ProcType -> InfraLogging]|
+---------+--------+----------------------------------------------------------------------------------------+