Question

我有这些值的数据框我需要过滤最小日期（groupby（id，count）和摘要应该更改为等于更多

id secid count date   summary
1   2     9    20170608  equal
1   3     9    20160608  equal
2   3     8    20170608  less
3   3     9    20160608  equal

我需要展示

id secid count date   summary
1   2     9    20170608  more
2   3     8    20170608  less
3   3     9    20160608  equal

Answer 1

您可以使用groupBy将id和count分组，然后使用when和otherwise将摘要字段更改为more如果您对同一date和id有更多count。

//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
      (1, 3, 9, 20160608, "equal"),
      (2, 3, 8, 20170608, "less"),
      (3, 3, 9, 20160608, "equal"),
      (1, 2, 8, 20170608, "random"),
      (1, 2, 8, 20170608, "random"))
      .toDF("id", "secid", "count", "date", "summary")

//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))

//apply groupby and other operations to get the result
df.groupBy("id", "count")
  .agg(collect_list("date").as("datelist"),
    max("date").as("date"),
    first("secid").as("secid"),
    first("summary").as("summary"))
  .withColumn("summary",
    when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
  .drop("datelist")
  .show()

//    output
//    +---+-----+--------+-----+-------+
//    | id|count|    date|secid|summary|
//    +---+-----+--------+-----+-------+
//    |  1|    9|20170608|    2|   more|
//    |  1|    8|20170608|    2| random|
//    |  3|    9|20160608|    3|  equal|
//    |  2|    8|20170608|    3|   less|
//    +---+-----+--------+-----+-------+

scala如何根据列值从df中删除行

1 个答案: