因此,我在Spark中有一个DataFrame,它看起来像这样:
[name,target] this is the header
[ABCD,1]
[XYZA,1]
[GFFD,1]
[NAAS,1]
[ABCD,2]
[XYZA,2]
[NAAS,2]
[VDDE,2]
我想像这样将其转换为数据框
[name, count(target=1), count(target=2)]
[ABCD, 1,1]
[XYZA, 1,1]
[GFFD, 1,0]
AND SO ON.....
有没有办法做到这一点?
答案 0 :(得分:1)
这是两种可能的解决方案。
样本输入数据:
import spark.implicits._
val df = Seq(
("ABCD",1),
("XYZA",1),
("GFFD",1),
("NAAS",1),
("ABCD",2),
("XYZA",2),
("NAAS",2),
("VDDE",2),
("EXAMPLE", 20)
).toDF("name", "target")
df.show()
+-------+------+
| name|target|
+-------+------+
| ABCD| 1|
| XYZA| 1|
| GFFD| 1|
| NAAS| 1|
| ABCD| 2|
| XYZA| 2|
| NAAS| 2|
| VDDE| 2|
|EXAMPLE| 20|
+-------+------+
1-使用map
仅返回非零出现。
case class DataItem(name: String, target: Int)
df.as[DataItem]
.groupByKey(_.name)
.mapGroups{
case (nameKey, targetIter) =>{
val targetList = targetIter.map(_.target).toSeq
val occMap = targetList.groupBy(a=>a).mapValues(_.size)
(nameKey, occMap)
}
}
.toDF("name", "target_count").show()
+-------+----------------+
| name| target_count|
+-------+----------------+
| VDDE| [2 -> 1]|
| NAAS|[2 -> 1, 1 -> 1]|
|EXAMPLE| [20 -> 1]|
| GFFD| [1 -> 1]|
| XYZA|[2 -> 1, 1 -> 1]|
| ABCD|[2 -> 1, 1 -> 1]|
+-------+----------------+
2-使用列表显示出现次数(包括0),其中索引= target_value。
case class DataItem(name: String, target: Int)
df.as[DataItem]
.groupByKey(_.name)
.mapGroups{
case (nameKey, targetIter) =>{
val targetList = targetIter.map(_.target).toSeq
val occMap = targetList.groupBy(a=>a).mapValues(_.size)
val maxTarget = occMap.maxBy(_._2)._1
val occList = for (i <- 1 until maxTarget+1) yield occMap.getOrElse(i, 0)
(nameKey, occList)
}
}
.toDF("name", "target_count").show(20, false)
+-------+------------------------------------------------------------+
|name |target_count |
+-------+------------------------------------------------------------+
|VDDE |[0, 1] |
|NAAS |[1, 1] |
|EXAMPLE|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]|
|GFFD |[1] |
|XYZA |[1, 1] |
|ABCD |[1, 1] |
+-------+------------------------------------------------------------+
答案 1 :(得分:1)
数据框可以通过“枢轴”进行转换:
df
.groupBy("name")
.pivot("target")
.count()
// replace nulls with 0
.na.fill(0)
使用Cesar A. Mostacero提供的数据,结果是:
+-------+---+---+---+
|name |1 |2 |20 |
+-------+---+---+---+
|EXAMPLE|0 |0 |1 |
|XYZA |1 |1 |0 |
|GFFD |1 |0 |0 |
|VDDE |0 |1 |0 |
|ABCD |2 |1 |0 |
|NAAS |1 |1 |0 |
+-------+---+---+---+