在不更改旧值的情况下更新pyspark中的现有列

时间:2019-12-26 20:53:39

标签: pyspark pyspark-dataframes

我正在尝试更新pyspark中的现有列,但尽管没有其他条件,但似乎该列中的旧值也正在更新

+-----+-----+-----+-----+-----+----+
|cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG|
+-----+-----+-----+-----+-----+----+
|   MY|    A|     |    1|    2|null|
|   MY|    G|     |    1|    2|null|
|   MY|     |    G|    1|    2|null|
|   TH|    A|     |   16|    2|null|
|   TH|    B|     |    1|   16|   1|
|   TH|     |    W|   16|    2|   1|
+-----+-----+-----+-----+-----+----+
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"],  ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))


df = df.withColumn('FLAG', F.when( (df.cntry == "MY") &  ( (df.cde_1.isin("G") ) |  (df.cde_2.isin("G") ) )   &  ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))

df = df.withColumn('FLAG', F.when( (df.cntry == "TH") &  ( (df.cde_1.isin("B", "W") ) |  (df.cde_2.isin("B", "W") ) )  & ( (df.rsn_1 == "16") |  (df.rsn_2 == "16") ) , 1))

1 个答案:

答案 0 :(得分:1)

您需要使用布尔值OR组合条件。像这样:

df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
cond1 = (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") )
cond2 = (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") )
df.withColumn("FLAG", F.when(cond1 | cond2, 1)).show()

在最后一行中,您覆盖 FLAG列,因为您没有引用其先前状态。这就是为什么不保留以前的值的原因。

除了合并表达式之外,还可以使用when(cond1, 1).otherwise(when(cond2, 1))。这是一种风格选择。