Question

我有2个CSV文件，它们是在不同的日期创建的，我想进行比较并显示什么保持不变以及发生了什么变化。我不知道从哪里开始或如何开始，因为当我尝试不同的合并和联接时，我遇到了数据帧大小不相同的问题。

df1 :
I    ID            Status         
0   123            Active   
1   124            Active  
2   125            Inactive   
3   126            Active  
4   128            Inactive

df2: 
I    ID            Status         
0   123            Active   
1   124            Inactive 
2   125            Inactive   
3   126            Active  
4   128            Active
5   129            Active  
6   130            Active   
7   131            Active
8   132            Inactive

目标是突出显示从df1到df2的状态变化，并保持从df1到df2的恒定值。使用上面的示例，也许我创建了两个看起来像这样的独立数据框：

df3: (containing all new changes)
I    ID              Status           
1    124             Inactive  
4    128             Active 
5    129             Active  
6    130             Active   
7    131             Active

df4: (containing all other ‘Active’ one that remained consistent)
I    ID             Status         
0   123             Active     
3   126             Active

为了解释每一行背后的逻辑以及为什么将其包含在df3中，我将逐行进行，因为我不知道我的例子是否足够清楚

df3:
Index 1 - active to inactive
Index 4 - inactive to active
Index 5 - new active row
Index 6 - new active row
Index 7 - new active row
Index 8 - new inactive row

df4:
Index 0 - remained constant
Index 2 - remained constant
Index 3 - remained constant

我不知道该如何处理，因为使用合并和合并时，我遇到一个错误，即数据框必须具有相同的大小。基本上，我想做的是找到从df1到df2发生了什么变化以及保持不变的东西。我有2个正在使用的样本数据集，它们具有更多状态，但是想法是相同的。 Here是同时包含两个csv文件的Google工作表，updated_values为df2，original_values为df1。

Answer 1

您需要执行完全外部联接才能从两个数据集中获取所有条目。所有不在df1中的df2值都将填充NaN值。

df3 = pd.merge(left=df1,right=df2,on='ID',how='outer', indicator=True)

此新df将包含一列“ Status_x”（其值为df1）和“ Status_y”（其值为df2）。然后，您可以简单地创建一个名为“ change”的新列来存储更改。您可以使用布尔索引来检查哪些列已更改：

new_rows = df3['_merge'] == 'right_only' # True if the IDs were not in df1
constant = df3['Status_x'] == df3['Status_y'] # True if the Status is the same for both Df

df3['change'] = df3['Status_x'] + ' to ' + df3['Status_y'] # String concatenation to show status change. E.g.: 'Active to Inactive'
df3.loc[new_rows,'change'] = 'New active row' #Sets the value for all new rows
df3.loc[constant,'change'] = 'Remained constant' #Sets the value for columns that remained constant

用熊猫找出2个不同大小的数据框之间的差异

1 个答案: