我需要映射一个如下所示的数据框:
+-------+-----------+
| key| value |
+-------+-----------+
| A| ['x', 'y']|
| B| ['y', 'z']|
| C| ['z']|
+-------+-----------+
这样的事情
+-------+------------------------+
| key| value |
+-------+------------------------+
| A|{'x': 1, 'y': 1, 'z': 0}|
| B|{'x': 0, 'y': 1, 'z': 1}|
| C|{'x': 0, 'y': 0, 'z': 1}|
+-------+------------------------+
列value
中的第一个数据帧包含每个键的实际值。关键是要将它映射到整个列中的不同元素(value_name:1,如果存在于不同的元素中,则为0)
答案 0 :(得分:2)
您可以explode
和pivot
from pyspark.sql.functions import explode, col, create_map, count, explode, lit
from itertools import chain
df = sc.parallelize([
("A", ["x", "y"]), ("B", ["y", "z"]), ("C", ["z"])
]).toDF(["key", "value"])
cnts = df.withColumn("value", explode("value")).groupBy("key").pivot("value").count().na.fill(0)
value = create_map(*chain.from_iterable((lit(c), col(c)) for c in cnts.columns if c != "key"))
cnts.select("key", value.alias("value")).show(truncate=False)
# +---+---------------------------+
# |key|value |
# +---+---------------------------+
# |B |Map(x -> 0, y -> 1, z -> 1)|
# |C |Map(x -> 0, y -> 0, z -> 1)|
# |A |Map(x -> 1, y -> 1, z -> 0)|
# +---+---------------------------+
或explode
,收集不同的值并应用udf
:
from pyspark.sql.functions import udf
keys = [k for k in chain.from_iterable(df.select(explode("value")).distinct().collect())]
def f(keys):
@udf("map<string,long>")
def _(values):
d = dict.fromkeys(keys, 0)
for v in values:
d[v] += 1
return d
return _
df.select("key", f(keys)("value").alias("value")).show(truncate=False)
#
# +---+---------------------------+
# |key|value |
# +---+---------------------------+
# |A |Map(x -> 1, y -> 1, z -> 0)|
# |B |Map(x -> 0, y -> 1, z -> 1)|
# |C |Map(x -> 0, y -> 0, z -> 1)|
# +---+---------------------------+