使用Spark 1.6,我有一个Spark DataFrame column
(名为let' s say col1
),其值为A,B,C,DS,DNS,E,F,G和H我想创建一个新列(比如col2
),其中包含dict
下面的值,如何映射? (所以f.i.' A'需要映射到' S'等等。)
dict = {'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}
答案 0 :(得分:25)
使用UDF的低效解决方案(版本无关):
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def translate(mapping):
def translate_(col):
return mapping.get(col)
return udf(translate_, StringType())
df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key'])
mapping = {
'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S',
'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}
df.withColumn("value", translate(mapping)("key"))
结果:
+-------+-----+
| key|value|
+-------+-----+
| DS| S|
| G| NS|
|INVALID| null|
+-------+-----+
更高效(仅限Spark 2.0+)是创建MapType
字面值:
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df.withColumn("value", mapping_expr.getItem(col("key")))
结果相同:
+-------+-----+
| key|value|
+-------+-----+
| DS| S|
| G| NS|
|INVALID| null|
+-------+-----+
但更高效的执行计划:
== Physical Plan ==
*Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53]
+- Scan ExistingRDD[key#15]
与UDF版本相比:
== Physical Plan ==
*Project [key#15, pythonUDF0#61 AS value#57]
+- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]
+- Scan ExistingRDD[key#15]
答案 1 :(得分:1)
听起来最简单的解决方案是使用replace函数: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
mapping= {
'A': '1',
'B': '2'
}
df2 = df.replace(to_replace=mapping, subset=['yourColName'])
答案 2 :(得分:0)
如果你想从嵌套字典中创建一个映射列,你可以使用这个:
def create_map(d,):
if type(d) != dict:
return F.lit(d)
level_map = []
for k in d:
level_map.append(F.lit(k))
level_map.append(create_map(d[k]))
return F.create_map(level_map)
d = {'a': 1, 'b': {'c': 2, 'd': 'blah'}}
print(create_map(d)) # <- Column<b'map(a, 1, b, map(c, 2, d, blah))'>