我有一列带有map的键和值在其中更改。我正在尝试提取值并创建一个新列。 输入
---------------+
|symbols |
+---------------+
|[3pea -> 3PEA] |
|[barello -> BA]|
|[] |
|[] |
+---------------+
预期产量
---------------+
|symbols |
+---------------+
|3PEA |
|BA |
| |
| |
+---------------+
这是我到目前为止使用UDF尝试过的
def map_value=udf((inputMap:Map[String,String])=> {inputMap.map(x=>x._2)
})
但这给了我
java.lang.UnsupportedOperationException: Schema for type scala.collection.immutable.Iterable[String] is not supported
答案 0 :(得分:0)
import org.apache.spark.sql.functions._
import spark.implicits._
val m = Seq(Array("A -> abc"), Array("B -> 0.11856755943424617"), Array("C -> kqcams"))
val df = m.toDF("map_data")
df.show
// Simulate your data I think.
val df2 = df.withColumn("xxx", split(concat_ws("",$"map_data"), "-> ")).select($"xxx".getItem(1).as("map_val")).drop("xxx")
df2.show(false)
导致:
+--------------------+
| map_data|
+--------------------+
| [A -> abc]|
|[B -> 0.118567559...|
| [C -> kqcams]|
+--------------------+
+-------------------+
|map_val |
+-------------------+
|abc |
|0.11856755943424617|
|kqcams |
+-------------------+
答案 1 :(得分:0)
自从Spark scala v2.3 api,sql v2.3 api或pyspark v2.4 api起,您就可以使用spark sql函数map_values
以下是在pyspark中,scala会非常相似。
设置(假设将SparkSession作为spark
工作):
from pyspark.sql import functions as F
df = (
spark.read.json(sc.parallelize(["""[
{"key": ["3pea"], "value": ["3PEA"] },
{"key": ["barello"], "value": ["BA"] }
]"""]))
.select(F.map_from_arrays(F.col("key"), F.col("value")).alias("symbols") )
)
df.printSchema()
df.show()
root
|-- symbols: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------+
| symbols|
+---------------+
| [3pea -> 3PEA]|
|[barello -> BA]|
+---------------+
df.select((F.map_values(F.col("symbols"))[0]).alias("map_vals")).show()
+--------+
|map_vals|
+--------+
| 3PEA|
| BA|
+--------+