我们有一个如下所示的spark数据框:
id | value
------+--------
0 | A,B
1 | A,C
2 | B
我们希望将其转换为:
id | A | B | C
------+-------+-------+-------+
0 | True | True | False |
1 | True | False | True |
2 | False | True | False |
进行这种转换的最佳方法是什么?
答案 0 :(得分:2)
我们假设它是您的输入数据框:
df = spark.createDataFrame([(0,["A","B"]),(1,["A","C"]),(2, ["B"])],["id","value"])
然后使用explode
和pivot
获得具有整数和空值的表。
df2 = df.withColumn("x",explode(df.value)).drop("value").groupBy("id").pivot("x").count()
df2.show()
+---+----+----+----+
| id| A| B| C|
+---+----+----+----+
| 0| 1| 1|null|
| 1| 1|null| 1|
| 2|null| 1|null|
+---+----+----+----+
最后,您只需要将值转换为布尔值即可,例如:
for col_name in df2.columns[1:]:
df2 = df2.withColumn(col_name, col(col_name).isNotNull())
df2.show()
+---+-----+-----+-----+
| id| A| B| C|
+---+-----+-----+-----+
| 0| true| true|false|
| 1| true|false| true|
| 2|false| true|false|
+---+-----+-----+-----+
答案 1 :(得分:1)
这是Scala的一种方法:
val df = Seq(
(0,"A,B"),
(1,"A,C"),
(2,"B"))
.toDF("id","value")
//store array from split
val withArrayDF = df.withColumn("array", split($"value", ",")).drop("value")
//get sorted unique values for the whole dataset
val distinctValues = withArrayDF.select(explode($"array")).distinct.collect.map{_.getString(0)}.sorted.toList
//foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false
distinctValues.map{ncol =>
withArrayDF.withColumn(ncol, array_contains($"array", ncol)).drop("array")
}.reduce(_.join(_,"id"))//join all A, B, C
.select("id", distinctValues:_*)
.show
输出:
+---+-----+-----+-----+
| id| A| B| C|
+---+-----+-----+-----+
| 0| true| true|false|
| 1| true|false| true|
| 2|false| true|false|
+---+-----+-----+-----+
以及python版本:
from pyspark.sql.functions import array_contains, split, when, col, explode
from functools import reduce
df = spark.createDataFrame(
[(0,"A,B"),
(1,"A,C"),
(2,"B")], ["id","value"])
# store array from split
withArrayDF = df.withColumn("array", split(df["value"], ",")).drop("value")
# get sorted unique values for the whole dataset
distinctValues = sorted(
list(
map(lambda row: row[0], withArrayDF.select(explode("array")).distinct().collect())))
# foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false
mappedDFs = list(
map(lambda ncol:
withArrayDF
.withColumn(ncol, array_contains(col("array"), ncol))
.drop("array"),
distinctValues
))
finalDF = reduce(lambda x,y: x.join(y, "id"), mappedDFs)
finalDF.show()
输出:
+---+-----+-----+-----+
| id| A| B| C|
+---+-----+-----+-----+
| 0| true| true|false|
| 1| true|false| true|
| 2|false| true|false|
+---+-----+-----+-----+