如何用逗号分隔的字符串构建映射行?

时间:2018-12-31 08:32:28

标签: scala apache-spark apache-spark-sql

    var clearedLine = ""
    var dict = collection.mutable.Map[String, String]()
    val rdd =  BufferedSource.map(line=> ({
          if (!line.endsWith(", ")) {
            clearedLine = line+", "
          } else{
            clearedLine = line.trim
          }
      clearedLine.split(",")(0).trim->clearedLine.split(",")(1).trim
      }
      //,clearedLine.split(",")(1).trim->clearedLine.split(",")(0).trim
    )
      //dict +=clearedLine.split(",")(0).trim.replace(" TO ","->")
    )

    for ((k,v) <- rdd) printf("key: %s, value: %s\n", k, v)

输出:

key: EQU EB.AR.DESCRIPT TO 1, value: EB.AR.ASSET.CLASS TO 2
key: EB.AR.CURRENCY TO 3, value: EB.AR.ORIGINAL.VALUE TO 4

我想按'TO'分割,然后夸大单个dict键->值,请帮助

   key: 1,  value: EQU EB.AR.DESCRIPT 
   key: 2   value: EB.AR.ASSET.CLASS
   key: 3,  value: EB.AR.CURRENCY
   key: 4,  value: EB.AR.ORIGINAL.VALUE

1 个答案:

答案 0 :(得分:2)

假设您输入的内容如下所示

EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2
EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4

尝试此scala df解决方案

scala> val df = Seq(("EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2"),("EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4")).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.show(false)
+----------------------------------------------+
|a                                             |
+----------------------------------------------+
|EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2|
|EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4|
+----------------------------------------------+


scala> val df2 = df.select(split($"a",",").getItem(0).as("a1"),split($"a",",").getItem(1).as("a2"))
df2: org.apache.spark.sql.DataFrame = [a1: string, a2: string]

scala> df2.show(false)
+-----------------------+--------------------------+
|a1                     |a2                        |
+-----------------------+--------------------------+
|EQU EB.AR.DESCRIPT TO 1|EB.AR.ASSET.CLASS TO 2    |
|EB.AR.CURRENCY TO 3    | EB.AR.ORIGINAL.VALUE TO 4|
+-----------------------+--------------------------+


scala> val df3 = df2.flatMap( r => { (0 until r.size).map( i=> r.getString(i) ) })
df3: org.apache.spark.sql.Dataset[String] = [value: string]

scala> df3.show(false)
+--------------------------+
|value                     |
+--------------------------+
|EQU EB.AR.DESCRIPT TO 1   |
|EB.AR.ASSET.CLASS TO 2    |
|EB.AR.CURRENCY TO 3       |
| EB.AR.ORIGINAL.VALUE TO 4|
+--------------------------+


scala> df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).show(false)
+---+---------------------+
|key|value                |
+---+---------------------+
|1  |EQU EB.AR.DESCRIPT   |
|2  |EB.AR.ASSET.CLASS    |
|3  |EB.AR.CURRENCY       |
|4  | EB.AR.ORIGINAL.VALUE|
+---+---------------------+

如果您希望它们作为“地图”列,则

scala> val df4 = df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).select(map($"key",$"value").as("kv"))
df4: org.apache.spark.sql.DataFrame = [kv: map<string,string>]

scala> df4.show(false)
+----------------------------+
|kv                          |
+----------------------------+
|[1 -> EQU EB.AR.DESCRIPT]   |
|[2 -> EB.AR.ASSET.CLASS]    |
|[3 -> EB.AR.CURRENCY]       |
|[4 ->  EB.AR.ORIGINAL.VALUE]|
+----------------------------+


scala> df4.printSchema
root
 |-- kv: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)


scala>