我有这些值的数据框我需要过滤最小日期(groupby(id,count)和摘要应该更改为等于更多
id secid count date summary
1 2 9 20170608 equal
1 3 9 20160608 equal
2 3 8 20170608 less
3 3 9 20160608 equal
我需要展示
id secid count date summary
1 2 9 20170608 more
2 3 8 20170608 less
3 3 9 20160608 equal
答案 0 :(得分:2)
您可以使用groupBy
将id
和count
分组,然后使用when
和otherwise
将摘要字段更改为more
如果您对同一date
和id
有更多count
。
//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
(1, 3, 9, 20160608, "equal"),
(2, 3, 8, 20170608, "less"),
(3, 3, 9, 20160608, "equal"),
(1, 2, 8, 20170608, "random"),
(1, 2, 8, 20170608, "random"))
.toDF("id", "secid", "count", "date", "summary")
//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))
//apply groupby and other operations to get the result
df.groupBy("id", "count")
.agg(collect_list("date").as("datelist"),
max("date").as("date"),
first("secid").as("secid"),
first("summary").as("summary"))
.withColumn("summary",
when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
.drop("datelist")
.show()
// output
// +---+-----+--------+-----+-------+
// | id|count| date|secid|summary|
// +---+-----+--------+-----+-------+
// | 1| 9|20170608| 2| more|
// | 1| 8|20170608| 2| random|
// | 3| 9|20160608| 3| equal|
// | 2| 8|20170608| 3| less|
// +---+-----+--------+-----+-------+