熊猫检查平等太慢使用

时间:2017-10-12 19:56:35

标签: python performance pandas dataframe

我需要检查从一个DataFrame更改为另一个DataFrame的记录。它必须匹配所有列。

一个是excel文件(new_df),一个是SQL查询(sql_df)。形状约为20,000行乘39列。我认为这对df.equals(other_df)

来说是个不错的选择

目前我正在使用以下内容:

import pandas as pd
import numpy as np
new_df = pd.DataFrame({'ID' : [0 ,1, 2, 3, 4, 5, 6, 7, 8, 9], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : [10,0,30,50,0,0,4,10,1,3], 
                    'D' : [1,0,3,4,0,0,7,8,0,1],
                    'E' : ['Universtiy of New York','New Hampshire University','JMU','Oklahoma State','Penn State',
                          'New Mexico Univ','Rutgers','Indiana State','JMU','University of South Carolina']})

sql_df= pd.DataFrame({'ID' : [0 ,1, 2, 3, 4, 5, 6, 7, 8, 9], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : [10,0,30,50,0,0,4,10,1,0], 
                    'D' : [5,0,3,4,0,0,7,8,0,1],
                    'E' : ['Universtiy of New York','New Hampshire University','NYU','Oklahoma State','Penn State',
                          'New Mexico Univ','Rutgers','Indiana State','NYU','University of South Carolina']})

# creates an empty list to append to
differences = []
# for all the IDs in the dataframe that should not change check if this record is the same in the database
# must use reset_index() so the equals() will work as I expect it to
# if it is not the same, append to a list which has the Aspn ID that is failing, along with the columns that changed
for unique_id in new_df['ID'].tolist():
# get the id from the list, and filter both sql and new dfs to this record
    if new_df.loc[new_df['ID'] == unique_id].reset_index(drop=True).equals(sql_df.loc[sql_df['ID'] == unique_id].reset_index(drop=True)) is False:
        bad_columns = []
        for column in new_df.columns.tolist():
        # if not the same above, check which column using the same logic
            if new_df.loc[new_df['ID'] == unique_id][column].reset_index(drop=True).equals(sql_df.loc[sql_df['ID'] == unique_id][column].reset_index(drop=True)) is False:
                bad_columns.append(column)                            
        differences.append([unique_id, bad_columns])

我稍后会使用differencesbad_columns并执行其他任务。

我希望避免使用许多循环......因为这可能是我性能问题的原因。它目前需要超过5分钟才能获得20,000条记录(因硬件而异),这是糟糕的表现。我正在考虑将所有列添加/连接成一个长字符串来进行比较,但这似乎是另一种低效的方式。什么是更好的解决方法/如何避免这种混乱附加到空列表解决方案?

2 个答案:

答案 0 :(得分:4)

In [26]: new_df.ne(sql_df)
Out[26]:
       B      C      D      E     ID
0  False  False   True  False  False
1  False  False  False  False  False
2  False  False  False   True  False
3  False  False  False  False  False
4  False  False  False  False  False
5  False  False  False  False  False
6   True  False  False  False  False
7  False  False  False  False  False
8  False  False  False   True  False
9  False   True  False  False  False

显示不相似的列:

In [27]: new_df.ne(sql_df).any(axis=0)
Out[27]:
B      True
C      True
D      True
E      True
ID    False
dtype: bool

显示不同的行:

In [28]: new_df.ne(sql_df).any(axis=1)
Out[28]:
0     True
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8     True
9     True
dtype: bool

<强>更新

显示不同的细胞:

In [86]: x = new_df.ne(sql_df)

In [87]: new_df[x].loc[x.any(1)]
Out[87]:
    B    C    D    E  ID
0 NaN  NaN  1.0  NaN NaN
2 NaN  NaN  NaN  JMU NaN
6 NaN  NaN  NaN  NaN NaN
8 NaN  NaN  NaN  JMU NaN
9 NaN  3.0  NaN  NaN NaN

In [88]: sql_df[x].loc[x.any(1)]
Out[88]:
    B    C    D    E  ID
0 NaN  NaN  5.0  NaN NaN
2 NaN  NaN  NaN  NYU NaN
6 NaN  NaN  NaN  NaN NaN
8 NaN  NaN  NaN  NYU NaN
9 NaN  0.0  NaN  NaN NaN

答案 1 :(得分:2)

获取过滤后的数据框,仅显示有差异的行:

result_df = new_df[new_df != sql_df].dropna(how='all')

>>> result_df
Out[]:
    B    C    D    E  ID
0 NaN  NaN  1.0  NaN NaN
2 NaN  NaN  NaN  JMU NaN
8 NaN  NaN  NaN  JMU NaN
9 NaN  3.0  NaN  NaN NaN

获取ID的元组和存在差异的列名称,这是您尝试生成的输出。 即使您有多个列具有相同ID的差异,

也应该有效
result_df.set_axis(labels=new_df.ID[result_df.index], axis=0)

>>> result_df.apply(lambda x: (x.name, result_df.columns[x.notnull()]), axis=1)
Out[]:
ID
0    (0, [D])
2    (2, [E])
8    (8, [E])
9    (9, [C])
dtype: object

请注意apply接近for循环,因此第二部分可能比第一部分花费更多时间。