Question

目前，我有一个类似于此列的数据框

 color
 -----
 green
 blue
 green
 red
 yellow
 red
 orange

依旧......（30种不同的颜色）。

从该列开始，我想将其转换为类似于此

的DataFrame

green blue red yellow orange purple ... more colors
  1     0   0     0     0       0
  0     1   0     0     0       0
  1     0   0     0     0       0
  0     0   1     0     0       0
  0     0   0     1     0       0
  0     0   1     0     0       0
  0     0   0     0     1       0

将每个变量设置为0的DataFrame，但原始列的同一索引上的颜色除外。

到目前为止，我已经尝试了不同的功能和解决方案，但没有一个工作（代码看起来非常混乱）。我想知道是否有一个“简单”或简单的方法来做到这一点，或者我应该使用像Pandas这样的另一个库（我正在使用Python）。如果你知道R，那么我想要的是table函数。

由于

Answer 1

这样的事情可以解决问题：

from pyspark.sql.functions import when, lit, col

colors = df.select("color").distinct().map(lambda x: x[0]).collect()
cols = (
    when(col("color") == lit(color), 1).otherwise(0).alias(color)
    for color in colors
)

df.select(*cols)

如果您正在寻找与R table类似的其他解决方案，您可能需要查看crosstab和cube。

注意

当级别数量很大时，创建密集数据帧变得相当低效。在这种情况下，您应该考虑使用稀疏向量：

from pyspark.sql import Row from pyspark.mllib.linalg import Vectors from pyspark.ml.feature import StringIndexer def toVector(n): def _toVector(i): return Row("vec")(Vectors.sparse(n, {i: 1.0})) return _toVector indexer = StringIndexer(inputCol="color", outputCol="colorIdx") indexed = indexer.fit(df).transform(df) n = indexed.select("colorIdx").distinct().count() vectorized = indexed.select("colorIdx").map(toVector(n)).toDF()

Spark - 单个列到X类的列

1 个答案: