如何从键值映射图,Spark DataFrame中提取值

时间:2018-11-23 18:59:32

标签: apache-spark-sql

我有一列带有map的键和值在其中更改。我正在尝试提取值并创建一个新列。 输入

---------------+
|symbols        |
+---------------+
|[3pea -> 3PEA] |
|[barello -> BA]|
|[]             |
|[]             |
+---------------+

预期产量

---------------+
|symbols        |
+---------------+
|3PEA         |
|BA           |
|             |
|            |
+---------------+

这是我到目前为止使用UDF尝试过的

def map_value=udf((inputMap:Map[String,String])=> {inputMap.map(x=>x._2) 
      }) 

但这给了我

java.lang.UnsupportedOperationException: Schema for type scala.collection.immutable.Iterable[String] is not supported

2 个答案:

答案 0 :(得分:0)

import org.apache.spark.sql.functions._
import spark.implicits._
val m = Seq(Array("A -> abc"), Array("B -> 0.11856755943424617"), Array("C -> kqcams"))

val df = m.toDF("map_data")
df.show
// Simulate your data I think.

val df2 = df.withColumn("xxx", split(concat_ws("",$"map_data"), "-> ")).select($"xxx".getItem(1).as("map_val")).drop("xxx")
df2.show(false)

导致:

+--------------------+
|            map_data|
+--------------------+
|          [A -> abc]|
|[B -> 0.118567559...|
|       [C -> kqcams]|
+--------------------+

+-------------------+
|map_val            |
+-------------------+
|abc                |
|0.11856755943424617|
|kqcams             |
+-------------------+

答案 1 :(得分:0)

自从Spark scala v2.3 apisql v2.3 apipyspark v2.4 api起,您就可以使用spark sql函数map_values

以下是在pyspark中,scala会非常相似。
设置(假设将SparkSession作为spark工作):

from pyspark.sql import functions as F

df = (
    spark.read.json(sc.parallelize(["""[
        {"key": ["3pea"],    "value": ["3PEA"] },
        {"key": ["barello"], "value": ["BA"]   }
    ]"""]))
    .select(F.map_from_arrays(F.col("key"), F.col("value")).alias("symbols") )
)

df.printSchema()
df.show()
root
 |-- symbols: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---------------+
|        symbols|
+---------------+
| [3pea -> 3PEA]|
|[barello -> BA]|
+---------------+
df.select((F.map_values(F.col("symbols"))[0]).alias("map_vals")).show()
+--------+
|map_vals|
+--------+
|    3PEA|
|      BA|
+--------+