scala如何根据列值从df中删除行

时间:2018-04-12 08:08:20

标签: scala apache-spark-sql spark-dataframe

我有这些值的数据框我需要过滤最小日期(groupby(id,count)和摘要应该更改为等于更多

id secid count date   summary
1   2     9    20170608  equal
1   3     9    20160608  equal
2   3     8    20170608  less
3   3     9    20160608  equal

我需要展示

id secid count date   summary
1   2     9    20170608  more
2   3     8    20170608  less
3   3     9    20160608  equal

1 个答案:

答案 0 :(得分:2)

您可以使用groupByidcount分组,然后使用whenotherwise将摘要字段更改为more如果您对同一dateid有更多count

//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
      (1, 3, 9, 20160608, "equal"),
      (2, 3, 8, 20170608, "less"),
      (3, 3, 9, 20160608, "equal"),
      (1, 2, 8, 20170608, "random"),
      (1, 2, 8, 20170608, "random"))
      .toDF("id", "secid", "count", "date", "summary")

//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))

//apply groupby and other operations to get the result
df.groupBy("id", "count")
  .agg(collect_list("date").as("datelist"),
    max("date").as("date"),
    first("secid").as("secid"),
    first("summary").as("summary"))
  .withColumn("summary",
    when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
  .drop("datelist")
  .show()

//    output
//    +---+-----+--------+-----+-------+
//    | id|count|    date|secid|summary|
//    +---+-----+--------+-----+-------+
//    |  1|    9|20170608|    2|   more|
//    |  1|    8|20170608|    2| random|
//    |  3|    9|20160608|    3|  equal|
//    |  2|    8|20170608|    3|   less|
//    +---+-----+--------+-----+-------+