Question

我需要找到给定Pandas数据帧中列值的不同之处。

我使用此处描述的技术汇总了我的数据框：compare two pandas data frame

使用此代码，我可以在旧数据集和新数据集之间获取添加的行和已删除的行。其中df1是旧数据集，df2是较新的数据集。它们具有相同的架构。

m = df1.merge(df2, on=['ID', 'Name'], how='outer', suffixes=['', '_'])
adds = m.loc[m.GPA_.notnull() & m.GPA.isnull()]
deletes = m.loc[m.GPA_.isnull() & m.GPA.notnull()]

我想要做的是从合并的数据帧中过滤掉添加和删除，然后比较列值：

for col in m.columns:
    m["diff_%s" % field] = m[field] != m["%s_" % field]

这应该会导致添加多个布尔列来检查值的变化。所以我的问题是，在应用此列逻辑之前，如何先过滤掉添加和删除行？

其他信息：

_data_orig = [
[1, "Bob", 3.0],
[2, "Sam", 2.0],
[3, "Jane", 4.0]]
_columns = ["ID", "Name", "GPA"]

_data_new = [
        [1, "Bob", 3.2],
        [3, "Jane", 3.9],
        [4, "John", 1.2],
        [5, "Lisa", 2.2]
    ]
_columns = ["ID", "Name", "GPA"]

df1 = pd.DataFrame(data=_data_orig, columns=_columns)
df2 = pd.DataFrame(data=_data_new, columns=_columns)

m = df1.merge(df2, on=['ID', 'Name'], how='outer', suffixes=['', '_'])
adds = m.loc[m.GPA_.notnull() & m.GPA.isnull()]
deletes = m.loc[m.GPA_.isnull() & m.GPA.notnull()]

# TODO: add code to remove adds/deletes here
# array should now be: [[1, "Bob", 3.2],
#        [3, "Jane", 3.9]]
for col in m.columns:
    m["diff_%s" % field] = m[field] != m["%s_" % field]
# results in:
# array with columns ['ID', 'Name', 'GPA', 'Name_', 'GPA_','diff_GPD', 'diff_Name'
# ... DO other stuff
# write to csv

Answer 1

您可以使用Index.union将indexes和drop行标记为idx：

idx = adds.index.union(deletes.index)
print (idx)
Int64Index([1, 3, 4], dtype='int64')

print (m.drop(idx))
   ID  Name  GPA  GPA_
0   1   Bob  3.0   3.2
2   3  Jane  4.0   3.9

boolean indexing的另一个解决方案：

mask = ~((m.GPA_.notnull() & m.GPA.isnull()) | ( m.GPA_.isnull() & m.GPA.notnull()))
print (mask)
0     True
1    False
2     True
3    False
4    False
dtype: bool

print (m[mask])
   ID  Name  GPA  GPA_
0   1   Bob  3.0   3.2
2   3  Jane  4.0   3.9

比较Pandas Dataframe中的列值

1 个答案: