我正在尝试更新pyspark中的现有列,但尽管没有其他条件,但似乎该列中的旧值也正在更新
+-----+-----+-----+-----+-----+----+ |cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG| +-----+-----+-----+-----+-----+----+ | MY| A| | 1| 2|null| | MY| G| | 1| 2|null| | MY| | G| 1| 2|null| | TH| A| | 16| 2|null| | TH| B| | 1| 16| 1| | TH| | W| 16| 2| 1| +-----+-----+-----+-----+-----+----+
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
df = df.withColumn('FLAG', F.when( (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))
df = df.withColumn('FLAG', F.when( (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") ) , 1))
答案 0 :(得分:1)
您需要使用布尔值OR组合条件。像这样:
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
cond1 = (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") )
cond2 = (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") )
df.withColumn("FLAG", F.when(cond1 | cond2, 1)).show()
在最后一行中,您覆盖 FLAG
列,因为您没有引用其先前状态。这就是为什么不保留以前的值的原因。
除了合并表达式之外,还可以使用when(cond1, 1).otherwise(when(cond2, 1))
。这是一种风格选择。