我有一个带有user_tag列的数据框,我想要一个新的随机UUID值,怎么办?
--------------------------------------
| user_tag | pref_code | name |
--------------------------------------
| abc123 | Reg | Richard |
| abc123 | Reg | Mort |
| abc123 | Disc | Jack |
我想在spark中为user_tag生成randomUUID。有
-------------------------------------------------------------------
| user_tag | pref_code | name |
-------------------------------------------------------------------
| af3fb8b8-7ceb-4cec-ac27-2a034bb44bb9 | Reg | Richard |
| snc22fls-2cgb-sas2-hc26-43d35ggg4522 | Reg | Mort |
| afgdw8b8-4fss-ycec-ycd7-haj3jbbj4bj9 | Disc | Jack |
我尝试过:但是每行都会得到相同的UUID
val withUUID = dataFrame.withColumn("user_tag",
when(col("user_tag") === "abc123", randomUUID.toString).otherwise(col("user_tag")))
答案 0 :(得分:0)
您可以尝试创建 udf
,然后在 case when-then statement
中调用udf。
示例:
val rand_UUID = udf(() => java.util.UUID.randomUUID().toString) //udf to generate randomUUID
val df=Seq(("abc123","Reg","Richard"),("abc123","Reg","Mort"))
.toDF("user_tag","pref_code","name")
df.withColumn("user_tag",when('user_tag === "abc123",rand_UUID())
.otherwise('user_tag))
.show(false)
结果:
+------------------------------------+---------+-------+
|user_tag |pref_code|name |
+------------------------------------+---------+-------+
|e0b3c917-dcc5-4c42-bfe3-32af18b1cfec|Reg |Richard|
|90098d7d-8dc7-42df-a89b-5bd7f2c5cd99|Reg |Mort |
+------------------------------------+---------+-------+
基本上,每次比赛都会调用udf,然后生成randomUUID
。