我有两个数据帧。例子:
df1:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
每个数据框都以日期作为索引。两个数据帧都具有相同的结构。
我想要做的是比较这两个数据帧并找出df2中哪些行不在df1中。我想比较日期(索引)和第一列(Banana,APple等),看看它们是否存在于df2 vs df1中。
我尝试了以下内容:
对于第一种方法,我收到此错误: “异常:只能比较标记相同的DataFrame对象” 。我已经尝试删除日期作为索引,但得到相同的错误。
在third approach上,我得到断言返回False,但无法弄清楚如何实际看到不同的行。
欢迎提出任何指示
答案 0 :(得分:61)
此方法df1 != df2
仅适用于具有相同行和列的数据框。实际上,所有数据帧轴都与_indexed_same
方法进行比较,如果发现差异,则会引发异常,即使在列/索引顺序中也是如此。
如果我找到了你,你想要找不到变化,而是对称差异。为此,一种方法可能是连接数据帧:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
分组
>>> df_gpby = df.groupby(list(df.columns))
获取唯一记录的索引
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
过滤
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
答案 1 :(得分:14)
将数据帧传递到字典中的concat,从而生成一个多索引数据框,您可以从中轻松删除重复项,从而产生具有数据帧之间差异的多索引数据框:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
DF1 = StringIO("""Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange""")
df1 = pd.read_table(DF1, sep='\s+')
df2 = pd.read_table(DF2, sep='\s+')
#%%
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
df.drop_duplicates(keep=False)
结果:
Date Fruit Num Color
DF2 4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
答案 2 :(得分:6)
将 ling 对上面 jur 的回复的评论更新和放置到其他人更容易找到的地方。
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
使用这些数据框进行测试:
df1=pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green'],
})
df2=pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})
答案 3 :(得分:4)
基于alko的答案几乎对我有用,除了过滤步骤(我得到的地方:ValueError: cannot reindex from a duplicate axis
),这是我使用的最终解决方案:
# join the dataframes
united_data = pd.concat([data1, data2, data3, ...])
# group the data by the whole row to find duplicates
united_data_grouped = united_data.groupby(list(united_data.columns))
# detect the row indices of unique rows
uniq_data_idx = [x[0] for x in united_data_grouped.indices.values() if len(x) == 1]
# extract those unique values
uniq_data = united_data.iloc[uniq_data_idx]
答案 4 :(得分:2)
有一个更简单的解决方案,更快更好, 如果数字不同甚至可以给你数量差异:
df1_i = df1.set_index(['Date','Fruit','Color'])
df2_i = df2.set_index(['Date','Fruit','Color'])
df_diff = df1_i.join(df2_i,how='outer',rsuffix='_').fillna(0)
df_diff = (df_diff['Num'] - df_diff['Num_'])
这里df_diff是差异的概要。您甚至可以使用它来查找数量的差异。在您的示例中:
说明: 与比较两个列表类似,为了有效地做到这一点,我们应该首先对它们进行排序然后比较它们(将列表转换为集合/散列也会很快;两者都是对简单的O(N ^ 2)双重比较循环的一个令人难以置信的改进
注意:以下代码生成表:
df1=pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green'],
})
df2=pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})
答案 5 :(得分:2)
# given
df1=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green']})
df2=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,1000,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange']})
# find which rows are in df2 that aren't in df1 by Date and Fruit
df_2notin1 = df2[~(df2['Date'].isin(df1['Date']) & df2['Fruit'].isin(df1['Fruit']) )].dropna().reset_index(drop=True)
# output
print('df_2notin1\n', df_2notin1)
# Color Date Fruit Num
# 0 Red 2013-11-25 Apple 22.1
# 1 Orange 2013-11-25 Orange 8.6
答案 6 :(得分:2)
自pandas >= 1.1.0
起,我们拥有DataFrame.compare
和Series.compare
。
注意:该方法只能比较标记相同的DataFrame对象, 这意味着具有相同行和列标签的数据框。
df1 = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, np.NaN, 9]})
df2 = pd.DataFrame({'A': [1, 99, 3],
'B': [4, 5, 81],
'C': [7, 8, 9]})
A B C
0 1 4 7.0
1 2 5 NaN
2 3 6 9.0
A B C
0 1 4 7
1 99 5 8
2 3 81 9
df1.compare(df2)
A B C
self other self other self other
1 2.0 99.0 NaN NaN NaN 8.0
2 NaN NaN 6.0 81.0 NaN NaN
答案 7 :(得分:1)
我得到了这个解决方案。这对你有帮助吗?
text = """df1:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
argetz45
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 118.6 Orange
2013-11-24 Apple 74.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Nuts 45.8 Brown
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
2013-11-26 Pear 102.54 Pale"""
from collections import OrderedDict
import re
r = re.compile('([a-zA-Z\d]+).*\n'
'(20\d\d-[01]\d-[0123]\d.+\n?'
'(.+\n?)*)'
'(?=[ \n]*\Z'
'|'
'\n+[a-zA-Z\d]+.*\n'
'20\d\d-[01]\d-[0123]\d)')
r2 = re.compile('((20\d\d-[01]\d-[0123]\d) +([^\d.]+)(?<! )[^\n]+)')
d = OrderedDict()
bef = []
for m in r.finditer(text):
li = []
for x in r2.findall(m.group(2)):
if not any(x[1:3]==elbef for elbef in bef):
bef.append(x[1:3])
li.append(x[0])
d[m.group(1)] = li
for name,lu in d.iteritems():
print '%s\n%s\n' % (name,'\n'.join(lu))
结果
df1
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
argetz45
2013-11-25 Nuts 45.8 Brown
2013-11-26 Pear 102.54 Pale
答案 8 :(得分:0)
需要注意的一个重要细节是,您的数据具有重复的索引值,因此要执行任何直接比较,我们需要将所有内容与df.reset_index()
一起变为唯一,因此我们可以执行基于选择的选择在条件上。一旦你的情况下定义了索引,我假设你想保留de index所以有一个单行解决方案:
[~df2.reset_index().isin(df1.reset_index())].dropna().set_index('Date')
一旦从pythonic的角度来看,目标是提高可读性,我们可以稍微打破一下:
# keep the index name, if it does not have a name it uses the default name
index_name = df.index.name if df.index.name else 'index'
# setting the index to become unique
df1 = df1.reset_index()
df2 = df2.reset_index()
# getting the differences to a Dataframe
df_diff = df2[~df2.isin(df1)].dropna().set_index(index_name)
答案 9 :(得分:0)
答案 10 :(得分:0)
希望这对您有用。 ^ o ^
df1 = pd.DataFrame({'date': ['0207', '0207'], 'col1': [1, 2]})
df2 = pd.DataFrame({'date': ['0207', '0207', '0208', '0208'], 'col1': [1, 2, 3, 4]})
print(f"df1(Before):\n{df1}\ndf2:\n{df2}")
"""
df1(Before):
date col1
0 0207 1
1 0207 2
df2:
date col1
0 0207 1
1 0207 2
2 0208 3
3 0208 4
"""
old_set = set(df1.index.values)
new_set = set(df2.index.values)
new_data_index = new_set - old_set
new_data_list = []
for idx in new_data_index:
new_data_list.append(df2.loc[idx])
if len(new_data_list) > 0:
df1 = df1.append(new_data_list)
print(f"df1(After):\n{df1}")
"""
df1(After):
date col1
0 0207 1
1 0207 2
2 0208 3
3 0208 4
"""
答案 11 :(得分:0)
我尝试了这种方法,并且有效。我希望它也能提供帮助:
"""Identify differences between two pandas DataFrames"""
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
df_all = pd.concat([df1, df12], axis='columns', keys=['First', 'Second'])
df_final = df_all.swaplevel(axis='columns')[df1.columns[1:]]
df_final[df_final['change this to one of the columns'] != df_final['change this to one of the columns']]
答案 12 :(得分:0)
# THIS WORK FOR ME
# Get all diferent values
df3 = pd.merge(df1, df2, how='outer', indicator='Exist')
df3 = df3.loc[df3['Exist'] != 'both']
# If you like to filter by a common ID
df3 = pd.merge(df1, df2, on="Fruit", how='outer', indicator='Exist')
df3 = df3.loc[df3['Exist'] != 'both']
答案 13 :(得分:0)
将现有数据从 df2
获取到 df1
:
dfe = df2[df2["Fruit"].isin(df1["Fruit"])]
从df2
获取不存在的数据到df1
:
dfn = df2[~ df2["Fruit"].isin(df1["Fruit"])]
您可以使用多个比较。
答案 14 :(得分:0)
您可以找到 DataFrame 行数之间的差异:
df2.value_counts().sub(df1.value_counts(), fill_value=0)
输出:
Date Fruit Num Color
2013-11-24 Apple 7.6 Green 0.0
Banana 22.1 Yellow 0.0
Celery 10.2 Green -1.0
1000.0 Green 1.0
Orange 8.6 Orange 0.0
2013-11-25 Apple 22.1 Red 1.0
Orange 8.6 Orange 1.0
dtype: float6