需要根据其他列向DF下方添加新列。这是DF架构
scala> a.printSchema()
root
|-- ID: decimal(22,0) (nullable = true)
|-- NAME: string (nullable = true)
|-- AMOUNT: double (nullable = true)
|-- CODE: integer (nullable = true)
|-- NAME1: string (nullable = true)
|-- code1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- revised_code string (nullable = true)
现在我想根据以下条件添加一个列标志
1=> if code == revised_code, than flag is P
2 => if code != revised code than I
3=> if both code and revised_code is null than no flag.
这是我正在尝试的udf,但它为案例1和3提供了I
。
def tagsUdf =
udf((code: String, revised_code: String) =>
if (code == null && revised_code == null ) ""
else if (code == revised_code) "P" else "I")
tagsUdf(col("CODE"), col("revised_code"))
任何人都可以指出我在做什么错误
I/P DF
+-------------+-------+------------+
|NAME | CODE|revised_code|
+-------------+-------+------------+
| amz | null| null|
| Watch | null| 5812|
| Watch | null| 5812|
| Watch | 5812| 5812|
| amz | null| null|
| amz | 9999 | 4352|
+-------------+-------+-----------+
Schema:
root
|-- MERCHANT_NAME: string (nullable = true)
|-- CODE: integer (nullable = true)
|-- revised_mcc: string (nullable = true)
O/P DF
+-------------+-------+-----------------+
|NAME | CODE|revised_code|flag|
+-------------+-------+-----------------+
| amz | null| null| null|
| Watch | null| 5812| I |
| Watch | null| 5812| I |
| Watch | 5812| 5812| P |
| amz | null| null| null|
|amz | 9999 | 4352| I |
+-------------+-------+-----------------+
答案 0 :(得分:1)
您不需要udf
功能。一个简单的嵌套when
内置函数应该可以解决问题。
import org.apache.spark.sql.functions._
df.withColumn("CODE", col("CODE").cast("string"))
.withColumn("flag", when(((isnull(col("CODE")) || col("CODE") === "null") && (isnull(col("revised_code")) || col("revised_code") === "null")), "").otherwise(when(col("CODE") === col("revised_code"), "P").otherwise("I")))
.show(false)
此处,CODE
列在使用when进行逻辑应用之前会转换为stringType
,以便在比较时CODE
和revised_code
匹配数据类型。
注意:CODE
列是IntegerType
,在任何情况下都不能为空。