Question

使用spark读取下表后，我获得了数据帧：

val orgDF = spark.read.format("jdbc").options("url", url).options("dbtable","select id, org_name, delete_state, soft_del, 0 as soft_del from schema.table as orgTable").options("user",username)options("password",pwd).load()

我可以看到来自数据框的输出数据，如下所示：

-----------------------------------------
id | org_name  | delete_state | soft_del
-----------------------------------------
1  | Net       | delete       |   0
2  | Vert      | delete       |   0
3  | Bio       | insert       |   0
4  | Card      | delete       |   0
7  | stock     | update       |   0
-----------------------------------------

在将数据帧保存到HDFS之前，如果delete_state中列的值为delete，我试图将col：soft_del的值设置为'1'。并制作如下的最终数据框：

  -----------------------------------------
    id | org_name  | delete_state | soft_del
    -----------------------------------------
    1  | Net       | delete       |   1
    2  | Vert      | delete       |   1
    3  | Bio       | insert       |   0
    4  | Card      | delete       |   1
    7  | Stock     | update       |   0
    -----------------------------------------

我知道有一种方法可以做到：

orgDF.createOrReplaceTempView("orgData")
spark.sql("update orgData set soft_del = 1 where delete_state = 'delete'")

我还试图了解如何使用数据框函数并做到这一点，但无法找到正确的资料。任何人都可以让我知道如何使用数据框函数吗？

Answer 1

您可以尝试这样的事情

orgDF.withColumn("soft_del", when($"delete_state" === "delete", 1).otherwise(0))

如果需要，您还可以链接多个when，例如

orgDF.withColumn("soft_del", 
  when($"delete_state" === "delete", 1)
  .when($"delete_state" === "update", 2)
  .otherwise(0)
)

参考

when函数中的{li> scaladoc。

如何根据同一数据框中另一列的值替换数据框中的值？

1 个答案: