通过检查另一列的值从数据框中选择行

时间:2019-04-04 15:34:01

标签: scala apache-spark dataframe

我有一个如下数据框:

+--------+----------------+----+----------+
|role_num|   email_address|role|counters  |
+--------+----------------+----+----------+
|     110| EMAIL2@TEST.COM|null|         2|
|     110| EMAIL2@TEST.COM| P  |         2|
|     114|EMAIL10@TEST.COM| A  |         2|
|     114|EMAIL10@TEST.COM|null|         2|
+--------+----------------+----+----------+

在此数据框中,我的输出应如下所示:

+--------+----------------+----+----------+
|role_num|   email_address|role|counters  |
+--------+----------------+----+----------+
|     110| EMAIL2@TEST.COM| P  |         2|
|     114|EMAIL10@TEST.COM| A  |         2|
+--------+----------------+----+----------+

条件是每当重复计数为2时,我应该选择角色“ P”,但是如果该角色不存在,那么我需要选择“ A”。

我尝试过如下。但这似乎不起作用。

import sc.implicits._

val targetDF = Seq(
      ("110", "EMAIL2@TEST.COM", "", "2"),
      ("110", "EMAIL2@TEST.COM", "PAH", "2"),
      ("114", "EMAIL10@TEST.COM", "AAH", "2"),
      ("114", "EMAIL10@TEST.COM", "", "2")
      )
      .toDF(
        "role_num",
        "email_address",
        "role",
        "counters")

targetDF.where(
        (col("counters") > 1 )
           || ?)

你能帮忙吗?

1 个答案:

答案 0 :(得分:1)

此解决方案将与您当前的职位配合使用

targetDF
      .withColumn("priority", rank().over(Window.partitionBy("acct_num").orderBy(desc_nulls_last("role"))))
      .where(col("priority") === 1)
      .drop("priority")