Question

我有一个UDF：

val convert = udf((input:Int) =>  {input + 1})

原始UDF执行更复杂的计算，但是例如在这里我猜UDF就足够了。

然后我对我的数据框执行此操作：

.withColumn("id",convert(monotonically_increasing_id))

然后我尝试了：

spark.sql("select * from mytable where id>400 and id < 500").show(1000)

不知何故，我看到多个具有相同ID的行。 id似乎环绕，因此我在400到500之间的每个数字得到4次。

知道为什么会这样吗？

Answer 1

一种可能是整数溢出，因为monotonically_increasing_id返回Long，在这种情况下，将UDF切换为以下内容应该可以解决问题：

val convert = udf((input: Long) => input + 1)