将数据框列转换为像列一样的onehotencoder

时间:2019-07-24 10:36:56

标签: scala apache-spark

我正在尝试找到将特定列转换为onehotencoder类型列的解决方案。例如

-------------
Content|type|
-------------
alpha  | A  |
beta   | B  |
gamma  | C  |
theta  | A  |
zeta   | C  |
neta   | B  |
-------------

然后,我想做的是

----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha  |  1   |  0   |  0   |
beta   |  0   |  1   |  0   |
gamma  |  0   |  0   |  1   |
theta  |  1   |  0   |  0   |
zeta   |  0   |  0   |  1   |
neta   |  0   |  1   |  0   |
-----------------------------

1 个答案:

答案 0 :(得分:1)

我认为pivot是您要寻找的

val df = Seq(
  ("alpha", "A"),
  ("beta", "B"),
  ("gamma", "C"),
  ("theta", "A"),
  ("zeta", "C"),
  ("neta", "B")
).toDF("Content", "type")

val result = df.groupBy("Content")
  .pivot("type")
  .agg(count("type"))
  .na.fill(0)

输出:

+-------+---+---+---+
|Content|A  |B  |C  |
+-------+---+---+---+
|neta   |0  |1  |0  |
|beta   |0  |1  |0  |
|gamma  |0  |0  |1  |
|theta  |1  |0  |0  |
|zeta   |0  |0  |1  |
|alpha  |1  |0  |0  |
+-------+---+---+---+