Question

我有一个数据框，我需要在同一个数据帧上有两个不同的groupbys。

+----+-------+--------+----------------------------+
| id | type  | item   | value  | timestamp         |
+----+-------+--------+----------------------------+
| 1 |  rent  |  dvd   |  12    |2016-09-19T00:00:00Z
| 1 |  rent  |  dvd   |  12    |2016-09-19T00:00:00Z
| 1 | buy    |  tv    |  12    |2016-09-20T00:00:00Z
| 1 |  rent  |  movie |  12    |2016-09-20T00:00:00Z
| 1 |   buy  |  movie |  12    |2016-09-18T00:00:00Z
| 1 | buy    |  movie |  12    |2016-09-18T00:00:00Z
+----+-------+-------+------------------------------+

我想得到的结果为：

id : 1
totalValue  : 72 --- group by based on id
typeCount : {"rent" : 3, "buy" : 3} --- group by based on id
itemCount : {"dvd" : 2, "tv" : 1, "movie" : 3 } --- group by based on id
typeForDay : {"rent: 2, "buy" : 2 }  --- group By based on id and dayofmonth(col("timestamp"))  atmost 1 type per day

我试过了：

val count_by_value = udf {( listValues :scala.collection.mutable.WrappedArray[String]) => if (listValues == null) null else  listValues.groupBy(identity).mapValues(_.size)}


val group1 = df.groupBy("id").agg(collect_list("type"),sum("value") as "totalValue", collect_list("item")) 

val group1Result =  group1.withColumn("typeCount", count_by_value($"collect_list(type)"))
                          .drop("collect_list(type)")
                          .withColumn("itemCount", count_by_value($"collect_list(item)"))
                          .drop("collect_list(item)")


val group2 = df.groupBy("id", dayofmonth(col("timestamp"))).agg(collect_set("type")) 

val group2Result =  group2.withColumn("typeForDay", count_by_value($"collect_set(type)"))
                          .drop("collect_set(type)")


val groupedResult = group1Result.join(group2Result, "id").show()

但这需要时间，还有其他有效的方法吗？

Answer 1

更好的方法是将每个组字段添加到key＆amp;减少它们而不是groupBy（）。你可以使用这些：

df1.map(rec => (rec(0), rec(3).toString().toInt)).
     reduceByKey(_+_).take(5).foreach(println)

=＆GT; （1.72）

df1.map(rec => ((rec(0), rec(1)), 1)).
    map(x => (x._1._1, x._1._2,x._2)).
    reduceByKey(_+_).take(5).foreach(println)

=＆GT;（1，租金，3）

（1，买，3）

df1.map(rec => ((rec(0), rec(2)), 1)).
    map(x => (x._1._1, x._1._2,x._2)).
    reduceByKey(_+_).take(5).foreach(println)

=＆GT;（1，DVD，2）

（1，电视，1）

（1，电影，3）

df1.map(rec => ((rec(0), rec(1), rec(4).toString().substring(8,10)), 1)).
    reduceByKey(_+_).map(x => (x._1._1, x._1._2,x._1._3,x._2)).
    take(5).foreach(println)

=＆GT;（1，租金，19,2）

（1，买，20,1）

（1，买，18,2）

（1，租金，20,1）

如何在Scala中的同一数据框上执行2个不同的groupby条件？

1 个答案: