我试图突出显示两个数据帧之间的确切变化。
假设我有两个Python Pandas数据帧:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出一个HTML表:
输出具有相同,OLD和NEW值的行(理想情况下输入到HTML表格中),以便消费者可以清楚地看到两个数据帧之间发生了哪些变化:
"StudentRoster Difference Jan-1 - Jan-2":
id Name score isEnrolled Comment
112 Nick was 1.11| now 1.21 False Graduated
113 Zoe 4.12 was True | now False was "" | now "On vacation"
我想我可以逐行和逐列比较,但有更简单的方法吗?
答案 0 :(得分:126)
第一部分类似于Constantine,你可以得到哪些行为空的布尔值*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
然后我们可以看到哪些条目已经改变:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
此处第一个条目是索引,第二个条目是已更改的列。
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
*注意:df1
和df2
在此处共享相同的索引非常重要。为了克服这种歧义,您可以确保只使用df1.index & df2.index
查看共享标签,但我想我会将其作为练习。
答案 1 :(得分:58)
可以使用DataFrame样式属性突出显示存在差异的单元格的背景颜色。
使用原始问题中的示例数据
第一步是使用concat
函数水平连接DataFrames,并使用keys
参数区分每个帧:
df_all = pd.concat([df.set_index('id'), df2.set_index('id')],
axis='columns', keys=['First', 'Second'])
df_all
交换列级别并将相同的列名称放在一起可能更容易:
df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final
现在,更容易发现帧中的差异。但是,我们可以进一步使用style
属性来突出显示不同的单元格。我们定义了一个自定义函数来执行此操作,您可以在this part of the documentation中看到。
def highlight_diff(data, color='yellow'):
attr = 'background-color: {}'.format(color)
other = data.xs('First', axis='columns', level=-1)
return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
index=data.index, columns=data.columns)
df_final.style.apply(highlight_diff, axis=None)
这将突出显示两个都缺少值的单元格。您可以填写它们或提供额外的逻辑,以便它们不会突出显示。
答案 2 :(得分:32)
这个答案简单地扩展了@Andy Hayden,使其在数字字段nan
时具有弹性,并将其包装到函数中。
import pandas as pd
import numpy as np
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \
"DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
return None
else:
# need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
return pd.DataFrame({'from': changed_from, 'to': changed_to},
index=changed.index)
因此,使用您的数据(稍加编辑以在分数列中包含NaN):
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
DF1 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.11 False "Graduated"
113 Zoe NaN True " "
""")
DF2 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.21 False "Graduated"
113 Zoe NaN False "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)
输出:
from to
id col
112 score 1.11 1.21
113 isEnrolled True False
Comment On vacation
答案 3 :(得分:18)
我遇到过这个问题,但在找到这篇文章之前找到了答案:
根据unutbu的答案,加载您的数据......
import pandas as pd
import io
texts = ['''\
id Name score isEnrolled Date
111 Jack True 2013-05-01 12:00:00
112 Nick 1.11 False 2013-05-12 15:05:23
Zoe 4.12 True ''',
'''\
id Name score isEnrolled Date
111 Jack 2.17 True 2013-05-01 12:00:00
112 Nick 1.21 False
Zoe 4.12 False 2013-05-01 12:00:00''']
df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])
...定义你的 diff 函数......
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
然后你可以简单地使用Panel得出结论:
my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)
# id Name score isEnrolled Date
#0 111 Jack nan | 2.17 True 2013-05-01 12:00:00
#1 112 Nick 1.11 | 1.21 False 2013-05-12 15:05:23 | NaT
#2 nan | nan Zoe 4.12 True | False NaT | 2013-05-01 12:00:00
顺便说一句,如果你在IPython Notebook中,你可能想使用彩色的 diff 函数 根据细胞是否不同,相等或左/右为空来给出颜色:
from IPython.display import HTML
pd.options.display.max_colwidth = 500 # You need this, otherwise pandas
# will limit your HTML strings to 50 characters
def report_diff(x):
if x[0]==x[1]:
return unicode(x[0].__str__())
elif pd.isnull(x[0]) and pd.isnull(x[1]):
return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
'<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
'<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
'<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
else:
return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
'<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])
HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))
答案 4 :(得分:16)
import pandas as pd
import io
texts = ['''\
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True ''',
'''\
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation''']
df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2])
print(df)
# id Name score isEnrolled Comment
# 0 111 Jack 2.17 True He was late to class
# 1 112 Nick 1.11 False Graduated
# 2 113 Zoe 4.12 True NaN
# 0 111 Jack 2.17 True He was late to class
# 1 112 Nick 1.21 False Graduated
# 2 113 Zoe 4.12 False On vacation
df.set_index(['id', 'Name'], inplace=True)
print(df)
# score isEnrolled Comment
# id Name
# 111 Jack 2.17 True He was late to class
# 112 Nick 1.11 False Graduated
# 113 Zoe 4.12 True NaN
# 111 Jack 2.17 True He was late to class
# 112 Nick 1.21 False Graduated
# 113 Zoe 4.12 False On vacation
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)
打印
score isEnrolled Comment
id Name
111 Jack 2.17 True He was late to class
112 Nick 1.11 | 1.21 False Graduated
113 Zoe 4.12 True | False nan | On vacation
答案 5 :(得分:8)
如果您的两个数据帧中包含相同的ID,那么找出更改的内容实际上非常简单。只需执行frame1 != frame2
将为您提供一个布尔数据框架,其中每个True
都是已更改的数据。从那里,您可以通过changedids = frame1.index[np.any(frame1 != frame2,axis=1)]
轻松获取每个更改行的索引。
答案 6 :(得分:5)
使用concat和drop_duplicates的另一种方法:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
DF1 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.11 False "Graduated"
113 Zoe NaN True " "
""")
DF2 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.21 False "Graduated"
113 Zoe NaN False "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)
输出:
Name score isEnrolled Comment
id
1 112 Nick 1.11 False Graduated
113 Zoe NaN True
2 112 Nick 1.21 False Graduated
113 Zoe NaN False On vacation
答案 7 :(得分:5)
DataFrame.compare
使用pandas 1.1,您基本上可以通过单个函数调用来复制Ted Petrou的输出。来自文档的示例:
pd.__version__
# '1.1.0'
df1.compare(df2)
score isEnrolled Comment
self other self other self other
1 1.11 1.21 NaN NaN NaN NaN
2 NaN NaN 1.0 0.0 NaN On vacation
在此,“自身”是指LHS数据帧,而“其他”是指RHS数据帧。默认情况下,相等值将替换为NaN,因此您可以仅关注差异。如果要显示相等的值,请使用
df1.compare(df2, keep_equal=True, keep_shape=True)
score isEnrolled Comment
self other self other self other
1 1.11 1.21 False False Graduated Graduated
2 4.12 4.12 True False NaN On vacation
您还可以使用align_axis
更改比较轴:
df1.compare(df2, align_axis='index')
score isEnrolled Comment
1 self 1.11 NaN NaN
other 1.21 NaN NaN
2 self NaN 1.0 NaN
other NaN 0.0 On vacation
这将按行而不是按列比较值。
答案 8 :(得分:3)
扩展@cge的答案,这对结果的可读性来说非常酷:
a[a != b][np.any(a != b, axis=1)].join(DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
b[a != b][np.any(a != b, axis=1)]
,rsuffix='_b', how='outer'
).fillna('')
完整演示示例:
a = DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777
a[a != b][np.any(a != b, axis=1)].join(DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
b[a != b][np.any(a != b, axis=1)]
,rsuffix='_b', how='outer'
).fillna('')
答案 9 :(得分:3)
在摆弄@journois的答案之后,由于Panel's deprication,我能够使用MultiIndex而不是Panel来使用它。
首先,创建一些虚拟数据:
df1 = pd.DataFrame({
'id': ['111', '222', '333', '444', '555'],
'let': ['a', 'b', 'c', 'd', 'e'],
'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
'id': ['111', '222', '333', '444', '666'],
'let': ['a', 'b', 'c', 'D', 'f'],
'num': ['1', '2', 'Three', '4', '6'],
})
然后,定义你的 diff 函数,在这种情况下,我将使用他的答案report_diff
中的那个保持不变:
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
然后,我将数据连接到MultiIndex数据帧:
df_all = pd.concat(
[df1.set_index('id'), df2.set_index('id')],
axis='columns',
keys=['df1', 'df2'],
join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]
最后,我要在每个列组中应用report_diff
:
df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
输出:
let num
111 a 1
222 b 2
333 c 3 | Three
444 d | D 4
555 e | nan 5 | nan
666 nan | f nan | 6
这就是全部!
答案 10 :(得分:1)
以下是使用select和merge的另一种方法:
In [6]: # first lets create some dummy dataframes with some column(s) different
...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})
In [7]: df1
Out[7]:
a b c
0 -5 10 20
1 -4 11 21
2 -3 12 22
3 -2 13 23
4 -1 14 24
In [8]: df2
Out[8]:
a b c
0 -5 10 20
1 -4 11 101
2 -3 12 102
3 -2 13 103
4 -1 14 104
In [10]: # make condition over the columns you want to comapre
...: condition = df1['c'] != df2['c']
...:
...: # select rows from each dataframe where the condition holds
...: diff1 = df1[condition]
...: diff2 = df2[condition]
In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
a b c_before c_after
0 -4 11 21 101
1 -3 12 22 102
2 -2 13 23 103
3 -1 14 24 104
以下是Jupyter屏幕截图中的相同内容:
答案 11 :(得分:0)
以下实现查找两个数据帧之间不对称差异的函数: (基于set difference for pandas) GIST:https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03
def diff_df(df1, df2, how="left"):
"""
Find Difference of rows for given two dataframes
this function is not symmetric, means
diff(x, y) != diff(y, x)
however
diff(x, y, how='left') == diff(y, x, how='right')
Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
"""
if (df1.columns != df2.columns).any():
raise ValueError("Two dataframe columns must match")
if df1.equals(df2):
return None
elif how == 'right':
return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
elif how == 'left':
return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
else:
raise ValueError('how parameter supports only "left" or "right keywords"')
示例:
df1 = pd.DataFrame(d1)
Out[1]:
Comment Name isEnrolled score
0 He was late to class Jack True 2.17
1 Graduated Nick False 1.11
2 Zoe True 4.12
df2 = pd.DataFrame(d2)
Out[2]:
Comment Name isEnrolled score
0 He was late to class Jack True 2.17
1 On vacation Zoe True 4.12
diff_df(df1, df2)
Out[3]:
Comment Name isEnrolled score
1 Graduated Nick False 1.11
2 Zoe True 4.12
diff_df(df2, df1)
Out[4]:
Comment Name isEnrolled score
1 On vacation Zoe True 4.12
# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]:
Comment Name isEnrolled score
1 On vacation Zoe True 4.12
答案 12 :(得分:0)
如果您发现该线程试图比较测试中的数据帧,请查看assert_frame_equal
方法:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html
答案 13 :(得分:-1)
import pandas as pd
import numpy as np
df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')
df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]
srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]
csk = csk.reset_index(drop=True)
csk
srh = srh.reset_index(drop=True)
srh
new = pd.concat([srh, csk], axis=1)
new.head()
** PLAYER TYPE PLAYER TYPE 0 David Warner Batsman ... MS Dhoni Captain 1 Bhuvaneshwar Kumar Bowler ... Ravindra Jadeja All-Rounder 2 Manish Pandey Batsman ... Suresh Raina All-Rounder 3 Rashid Khan Arman Bowler ... Kedar Jadhav All-Rounder 4 Shikhar Dhawan Batsman .... Dwayne Bravo All-Rounder