如何根据同一数据框中另一列的值替换数据框中的值?

时间:2019-05-11 06:59:45

标签: scala apache-spark

使用spark读取下表后,我获得了数据帧:

val orgDF = spark.read.format("jdbc").options("url", url).options("dbtable","select id, org_name, delete_state, soft_del, 0 as soft_del from schema.table as orgTable").options("user",username)options("password",pwd).load()

我可以看到来自数据框的输出数据,如下所示:

-----------------------------------------
id | org_name  | delete_state | soft_del
-----------------------------------------
1  | Net       | delete       |   0
2  | Vert      | delete       |   0
3  | Bio       | insert       |   0
4  | Card      | delete       |   0
7  | stock     | update       |   0
-----------------------------------------

在将数据帧保存到HDFS之前,如果delete_state中列的值为delete,我试图将col:soft_del的值设置为'1'。 并制作如下的最终数据框:

  -----------------------------------------
    id | org_name  | delete_state | soft_del
    -----------------------------------------
    1  | Net       | delete       |   1
    2  | Vert      | delete       |   1
    3  | Bio       | insert       |   0
    4  | Card      | delete       |   1
    7  | Stock     | update       |   0
    -----------------------------------------

我知道有一种方法可以做到:

orgDF.createOrReplaceTempView("orgData")
spark.sql("update orgData set soft_del = 1 where delete_state = 'delete'")

我还试图了解如何使用数据框函数并做到这一点,但无法找到正确的资料。 任何人都可以让我知道如何使用数据框函数吗?

1 个答案:

答案 0 :(得分:2)

您可以尝试这样的事情

orgDF.withColumn("soft_del", when($"delete_state" === "delete", 1).otherwise(0))

如果需要,您还可以链接多个when,例如

orgDF.withColumn("soft_del", 
  when($"delete_state" === "delete", 1)
  .when($"delete_state" === "update", 2)
  .otherwise(0)
)

参考