我在Spark(PySpark)中有2个数据帧
DF_A
col1 col2 col3
a 1 100
b 2 300
c 3 500
d 4 700
DF_B
col1 col3
a 150
b 350
c 0
d 650
我想更新DF A的列,其值为DF_B.col3。
目前我在做
df_new = df_a.join(df_b, df_a.col1 == df_b.col1,'inner')
它在df_new中给我col1 X 2次和col3 X 2次。 现在我必须删除不相关的单元格以显示0.这样做的更好的方法是什么?不使用udfs。
答案 0 :(得分:1)
如果我正确理解您的问题,您正尝试执行以下操作:
UPDATE table_a A,table_b B SET A.col3 = B.col3 WHERE A.col1 = B.col1;在数据框架上。如果不存在于B中那么0 。 (参见评论)
a = [("a",1,100),("b",2,300),("c",3,500),("d",4,700)]
b = [("a",150),("b",350),("d",650)]
df_a = spark.createDataFrame(a,["col1","col2","col3"])
df_b = spark.createDataFrame(b,["col1","col3"])
df_a.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a| 1| 100|
# | b| 2| 300|
# | c| 3| 500|
# | d| 4| 700|
# +----+----+----+
df_b.show() # I have removed an entry for the purpose of the demo.
# +----+----+
# |col1|col3|
# +----+----+
# | a| 150|
# | b| 350|
# | d| 650|
# +----+----+
您需要执行outer join
后跟coalesce
:
from pyspark.sql import functions as F
df_a.withColumnRenamed('col3','col3_a') \
.join(df_b.withColumnRenamed('col3','col3_b'), on='col1', how='outer') \
.withColumn("col3", F.coalesce('col3_b', F.lit(0))) \
.drop(*['col3_a','col3_b']).show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | d| 4| 650|
# | c| 3| 0|
# | b| 2| 350|
# | a| 1| 150|
# +----+----+----+