使用spark读取下表后,我获得了数据帧:
val orgDF = spark.read.format("jdbc").options("url", url).options("dbtable","select id, org_name, delete_state, soft_del, 0 as soft_del from schema.table as orgTable").options("user",username)options("password",pwd).load()
我可以看到来自数据框的输出数据,如下所示:
-----------------------------------------
id | org_name | delete_state | soft_del
-----------------------------------------
1 | Net | delete | 0
2 | Vert | delete | 0
3 | Bio | insert | 0
4 | Card | delete | 0
7 | stock | update | 0
-----------------------------------------
在将数据帧保存到HDFS之前,如果delete_state
中列的值为delete
,我试图将col:soft_del的值设置为'1'。
并制作如下的最终数据框:
-----------------------------------------
id | org_name | delete_state | soft_del
-----------------------------------------
1 | Net | delete | 1
2 | Vert | delete | 1
3 | Bio | insert | 0
4 | Card | delete | 1
7 | Stock | update | 0
-----------------------------------------
我知道有一种方法可以做到:
orgDF.createOrReplaceTempView("orgData")
spark.sql("update orgData set soft_del = 1 where delete_state = 'delete'")
我还试图了解如何使用数据框函数并做到这一点,但无法找到正确的资料。 任何人都可以让我知道如何使用数据框函数吗?
答案 0 :(得分:2)
您可以尝试这样的事情
orgDF.withColumn("soft_del", when($"delete_state" === "delete", 1).otherwise(0))
如果需要,您还可以链接多个when
,例如
orgDF.withColumn("soft_del",
when($"delete_state" === "delete", 1)
.when($"delete_state" === "update", 2)
.otherwise(0)
)
参考
when
函数中的{li> scaladoc。