df1
USERID DATE
1 1/1/2018
1 1/2/2018
1 1/3/2018
2 1/2/2018
2 1/3/2018
3 1/3/2018
df2
USERID DATE
1 1/1/2018
2 1/2/2018
3 1/3/2018
我想将date
中的df2
与属于同一df1
的{{1}}进行比较,以判断USERID
中的行在{{ 1}}
df1
我想做相当于
df2
但现在返回错误Result:
USERID DATE Exists
1 1/1/2018 True
1 1/2/2018 False
1 1/3/2018 False
2 1/2/2018 True
2 1/3/2018 False
3 1/3/2018 True
答案 0 :(得分:2)
您可以执行merge
:
# create a new column
df2['Exists'] = True
df3 = pd.merge(df1,df2,on=['USERID','DATE'],how='outer').fillna(False)
USERID DATE Exists
0 1 1/1/2018 True
1 1 1/2/2018 False
2 1 1/3/2018 False
3 2 1/2/2018 True
4 2 1/3/2018 False
5 3 1/3/2018 True
答案 1 :(得分:0)
看起来您正在尝试执行left join,然后显示一个新列,其中df2
为空。
以下是从this SO answer到this post改编而成的示例:
from pyspark.sql import functions as F
# Alias the columns here, to prevent column name collision
df1_alias = df1.alias("first")
df2_alias = df2.alias("second")
# Left join on df1.id = df2.id and df1.date = df2.date
result = df1_alias.join(df2_alias, (df1_alias.id == df2_alias.id) & (df1_alias.date == df2_alias.date), how='left')
# Create a column called 'exists' and set it to true if there's a value defined for df2
result = result.withColumn('exists', F.col("second.id").isNotNull())
# Display just df1 values and the exists column
result.select([F.col("first.id"), F.col("first.name"), F.col("exists")]).show()