以一个热矢量的形式形成新的列

时间:2018-05-21 08:21:43

标签: apache-spark dataframe

我有一个数据框:

customer | Department
----------------------
A        |   Food
B        |   Home
A        |   Office
C        |   Home
A        |   Home
B        |   Office

Customer和Department列都是字符串类型

如何将不同类型的部门转换为新列,例如一个热矢量,以便创建如下所示的新数据框:

 customer | Food | Home | Office
-----------------------------------
    A        1     1      1
    B        0     1      1
    C        0     1      0

此处FoodHomeOffice列为整数类型,customerString类型。

1 个答案:

答案 0 :(得分:2)

您只需要groupcategory pivot数据,汇总为

val df = Seq(
  ("A", "Food"),
  ("B", "Home"),  
  ("A", "Office"),
  ("C", "Home"),
  ("A", "Home"),
  ("B", "Office")
).toDF("customer", "department")


df.groupBy("customer").pivot("department").agg(count("department"))
    .na.fill(0)

输出:

+--------+----+----+------+
|customer|Food|Home|Office|
+--------+----+----+------+
|B       |0   |1   |1     |
|C       |0   |1   |0     |
|A       |1   |1   |1     |
+--------+----+----+------+