Question

df1
  USERID    DATE
     1       1/1/2018
     1       1/2/2018
     1       1/3/2018
     2       1/2/2018
     2       1/3/2018
     3       1/3/2018

df2
  USERID    DATE
     1       1/1/2018        
     2       1/2/2018         
     3       1/3/2018

我想将date中的df2与属于同一df1的{{1}}进行比较，以判断USERID中的行在{{ 1}}

df1

我想做相当于 df2 但现在返回错误Result: USERID DATE Exists 1 1/1/2018 True 1 1/2/2018 False 1 1/3/2018 False 2 1/2/2018 True 2 1/3/2018 False 3 1/3/2018 True

Answer 1

您可以执行merge：

# create a new column 
df2['Exists'] = True

df3 = pd.merge(df1,df2,on=['USERID','DATE'],how='outer').fillna(False)

  USERID    DATE    Exists
0   1   1/1/2018    True
1   1   1/2/2018    False
2   1   1/3/2018    False
3   2   1/2/2018    True
4   2   1/3/2018    False
5   3   1/3/2018    True

Answer 2

看起来您正在尝试执行left join，然后显示一个新列，其中df2为空。

以下是从this SO answer到this post改编而成的示例：

from pyspark.sql import functions as F

# Alias the columns here, to prevent column name collision
df1_alias = df1.alias("first")
df2_alias = df2.alias("second")

# Left join on df1.id = df2.id and df1.date = df2.date
result = df1_alias.join(df2_alias, (df1_alias.id == df2_alias.id) & (df1_alias.date == df2_alias.date), how='left')

# Create a column called 'exists' and set it to true if there's a value defined for df2
result = result.withColumn('exists', F.col("second.id").isNotNull())

# Display just df1 values and the exists column
result.select([F.col("first.id"), F.col("first.name"), F.col("exists")]).show()

如何比较两个DF中没有匹配索引的两个日期？

2 个答案: