Question

我们有一个如下所示的spark数据框：

   id | value 
------+--------
    0 | A,B
    1 | A,C
    2 | B

我们希望将其转换为：

   id | A     | B     | C
------+-------+-------+-------+
    0 | True  | True  | False | 
    1 | True  | False | True  | 
    2 | False | True  | False |

进行这种转换的最佳方法是什么？

Answer 1

我们假设它是您的输入数据框：

df = spark.createDataFrame([(0,["A","B"]),(1,["A","C"]),(2, ["B"])],["id","value"])

然后使用explode和pivot获得具有整数和空值的表。

df2 = df.withColumn("x",explode(df.value)).drop("value").groupBy("id").pivot("x").count()
df2.show()

+---+----+----+----+
| id|   A|   B|   C|
+---+----+----+----+
|  0|   1|   1|null|
|  1|   1|null|   1|
|  2|null|   1|null|
+---+----+----+----+

最后，您只需要将值转换为布尔值即可，例如：

for col_name in df2.columns[1:]:
    df2 = df2.withColumn(col_name, col(col_name).isNotNull())

df2.show()

+---+-----+-----+-----+
| id|    A|    B|    C|
+---+-----+-----+-----+
|  0| true| true|false|
|  1| true|false| true|
|  2|false| true|false|
+---+-----+-----+-----+

Answer 2

这是Scala的一种方法：

val df = Seq(
(0,"A,B"),
(1,"A,C"),
(2,"B"))
.toDF("id","value")

    //store array from split
    val withArrayDF = df.withColumn("array", split($"value", ",")).drop("value")

    //get sorted unique values for the whole dataset
    val distinctValues = withArrayDF.select(explode($"array")).distinct.collect.map{_.getString(0)}.sorted.toList

    //foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false 
    distinctValues.map{ncol =>
      withArrayDF.withColumn(ncol, array_contains($"array", ncol)).drop("array")
    }.reduce(_.join(_,"id"))//join all A, B, C 
     .select("id", distinctValues:_*)
     .show

输出：

+---+-----+-----+-----+ 
| id|    A|    B|    C| 
+---+-----+-----+-----+ 
|  0| true| true|false| 
|  1| true|false| true| 
|  2|false| true|false| 
+---+-----+-----+-----+

以及python版本：

from pyspark.sql.functions import array_contains, split, when, col, explode
from functools import reduce

df = spark.createDataFrame(
[(0,"A,B"),
(1,"A,C"),
(2,"B")], ["id","value"])

# store array from split
withArrayDF = df.withColumn("array", split(df["value"], ",")).drop("value")

# get sorted unique values for the whole dataset
distinctValues = sorted(
                    list(
                      map(lambda row: row[0], withArrayDF.select(explode("array")).distinct().collect())))


# foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false 
mappedDFs = list(
              map(lambda ncol: 
                  withArrayDF
                          .withColumn(ncol, array_contains(col("array"), ncol))
                          .drop("array"), 
                  distinctValues
                 ))

finalDF = reduce(lambda x,y: x.join(y, "id"), mappedDFs)
finalDF.show()

输出：

+---+-----+-----+-----+ 
| id|    A|    B|    C| 
+---+-----+-----+-----+ 
|  0| true| true|false| 
|  1| true|false| true| 
|  2|false| true|false| 
+---+-----+-----+-----+

将字符串列的Spark数据帧拆分为多个布尔列

2 个答案: