PySpark将数据帧行值映射到列中的不同元素

时间:2018-04-11 05:00:29

标签: python apache-spark pyspark apache-spark-sql

我需要映射一个如下所示的数据框:

+-------+-----------+
|    key|     value |
+-------+-----------+
|      A| ['x', 'y']|
|      B| ['y', 'z']|
|      C|      ['z']|
+-------+-----------+

这样的事情

+-------+------------------------+
|    key|                  value |
+-------+------------------------+
|      A|{'x': 1, 'y': 1, 'z': 0}|
|      B|{'x': 0, 'y': 1, 'z': 1}|
|      C|{'x': 0, 'y': 0, 'z': 1}|
+-------+------------------------+

value中的第一个数据帧包含每个键的实际值。关键是要将它映射到整个列中的不同元素(value_name:1,如果存在于不同的元素中,则为0)

1 个答案:

答案 0 :(得分:2)

您可以explodepivot

from pyspark.sql.functions import explode, col, create_map, count, explode, lit
from itertools import chain

df = sc.parallelize([
    ("A", ["x", "y"]), ("B", ["y", "z"]), ("C", ["z"])
]).toDF(["key", "value"])

cnts = df.withColumn("value", explode("value")).groupBy("key").pivot("value").count().na.fill(0)
value = create_map(*chain.from_iterable((lit(c), col(c)) for c in cnts.columns if c != "key"))

cnts.select("key", value.alias("value")).show(truncate=False)
# +---+---------------------------+
# |key|value                      |
# +---+---------------------------+
# |B  |Map(x -> 0, y -> 1, z -> 1)|
# |C  |Map(x -> 0, y -> 0, z -> 1)|
# |A  |Map(x -> 1, y -> 1, z -> 0)|
# +---+---------------------------+

explode,收集不同的值并应用udf

from pyspark.sql.functions import udf

keys = [k for k in chain.from_iterable(df.select(explode("value")).distinct().collect())]

def f(keys):
    @udf("map<string,long>")
    def _(values):
        d = dict.fromkeys(keys, 0)
        for v in values:
            d[v] += 1
        return d
    return _

df.select("key", f(keys)("value").alias("value")).show(truncate=False)
# 
# +---+---------------------------+
# |key|value                      |
# +---+---------------------------+
# |A  |Map(x -> 1, y -> 1, z -> 0)|
# |B  |Map(x -> 0, y -> 1, z -> 1)|
# |C  |Map(x -> 0, y -> 0, z -> 1)|
# +---+---------------------------+