我有两个数据帧,df1
和df2
:
df1.show()
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 0 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df2.show()
+---+--------+-----+----+--------+
|cA | cB | cH | cI | cJ |
+---+--------+-----+----+--------+
| A| abc | a | b | ? |
| C| ghi | a | c | ? |
+---+--------+-----+----+--------+
如果在cE
中引用了该行,我想更新df1
中的1
列并将其设置为df2
。每个记录由cA
和cB
列标识。
下面是所需的输出;请注意,第一条记录的cE
值已更新为1
:
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 1 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
答案 0 :(得分:1)
这是我的答案。
这是scala代码-抱歉-我没有安装python。 希望有帮助。
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val ss = SparkSession.builder().master("local").getOrCreate()
import ss.implicits._
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val seq2 = Seq(
("A", "abc", "a", "b", "?"),
("C", "ghi", "a", "c", "?")
)
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")
val joined = df1.join(df2, (df1("cA") === df2("cA")).and(df1("cB") === df2("cB")), "left")
val res = joined.withColumn("newCe",
when(df2("cA").isNull.and(joined("cE") === lit(0)), lit(0)).otherwise(lit(1)))
res.select(df1("cA"), df1("cB"), df1("cC"), df1("cD"), res("newCe"))
.withColumnRenamed("newCe", "cE")
.show
对我来说,输出是:
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| E|mno| 0.1|0.1| 0|
| B|def|0.15|0.5| 0|
| C|ghi| 0.2|0.2| 1|
| A|abc| 0.1|0.0| 1|
| D|jkl| 1.1|0.1| 0|
+---+---+----+---+---+
答案 1 :(得分:1)
如果存在基于另一列更新列值的情况,那么when子句会派上用场。请参阅when and else条款。
import pyspark.sql.functions as F
df3=df1.join(df2,(df1.cA==df2.cA)&(df1.cB==df2.cB),"full").withColumn('cE',F.when((df1.cA==df2.cA)&(df1.cB==df2.cB),1).otherwise(0)).select(df1.cA,df1.cB,df1.cC,df1.cD,'cE')
df3.show()
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| E|mno| 0.1|0.1| 0|
| B|def|0.15|0.5| 0|
| C|ghi| 0.2|0.2| 1|
| A|abc| 0.1|0.0| 1|
| D|jkl| 1.1|0.1| 0|
+---+---+----+---+---+
答案 2 :(得分:0)
使用加入,您可以做自己想做的事情:
df1 = pd.DataFrame({ 'cA' : ['A', 'B', 'C', 'D', 'E'], 'cB' : ['abc', 'def', 'ghi', 'jkl', 'mno'], 'cE' : [0,0,1, 0, 0]})
df2 = pd.DataFrame({ 'cA' : ['A', 'C'], 'cB' : ['abc', 'ghi'], 'cE' : ['?','?']})
# join
df = df1.join(df2.set_index(['cA', 'cB']), lsuffix='_df1', rsuffix='_df2', on=['cA', 'cB'])
# nan values indicates rows that are not present in both dataframes
df.loc[~df['cE_df2'].isna(), 'cE_df2'] = 1
df.loc[df['cE_df2'].isna(), 'cE_df2'] = 0
df1['cE'] = df['cE_df2']
输出:
cA cB cE
0 A abc 1
1 B def 0
2 C ghi 1
3 D jkl 0
4 E mno 0
答案 3 :(得分:0)
尝试
for i in df2.values:
df1.loc[(df1.cA==i[0]) & (df1.cB == i[1]),['cE']] = 1