有效地比较Pandas Dataframe中各行的数据

时间:2016-05-03 21:31:16

标签: python python-2.7 pandas

我有一个CSV个月手机帐单的文件,我没有按照特定的顺序阅读Pandas Dataframe。我想为每个账单添加一个列,显示它与同一账户的先前账单有多大差异。此CSV只是我数据的子集。我的代码运行正常,但是当你查看接近一百万行的CSV文件时,它非常邋and且非常慢。

我应该怎样做才能提高效率?

CSV:

Account Number,Bill Month,Bill Amount
4543,3/1/2015,300
4543,1/1/2015,100
4543,2/1/2015,200
2322,1/1/2015,22
2322,3/1/2015,38
2322,2/1/2015,25

的Python:

import numpy as np
import pandas as pd
data = pd.read_csv('data.csv', low_memory=False)

# sort my data and reset the index so I can use index and index - 1 in the loop
data = data.sort_values(by=['Account Number', 'Bill Month'])
data = data.reset_index(drop=True)

# add a blank column for the difference
data['Difference'] = np.nan

for index, row in data.iterrows():

    # special handling for the first row so I don't get negative indexes
    if index == 0:
         data.ix[index, 'Difference'] = "-"
    else:
        # if the account in the current row and the row before are the same, then compare Bill Amounts
        if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']:
            data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount']
        else:
           data.ix[index, 'Difference'] = "-"

print data

期望的输出:

   Account Number Bill Month  Bill Amount Difference
0            2322   1/1/2015           22          -
1            2322   2/1/2015           25          3
2            2322   3/1/2015           38         13
3            4543   1/1/2015          100          -
4            4543   2/1/2015          200        100
5            4543   3/1/2015          300        100

2 个答案:

答案 0 :(得分:1)

试试这个:

In [37]: df = df.sort_values(['Account Number','Bill Month'])

In [38]: df['Difference'] = (df.groupby(['Account Number'])['Bill Amount']
   ....:                       .diff()
   ....:                       .fillna('-')
   ....:                    )

In [39]: df
Out[39]:
   Account Number Bill Month  Bill Amount Difference
3            2322 2015-01-01           22          -
5            2322 2015-02-01           25          3
4            2322 2015-03-01           38         13
1            4543 2015-01-01          100          -
2            4543 2015-02-01          200        100
0            4543 2015-03-01          300        100

说明:

diff()将分别应用于每个组 - 它将返回" next"之间的差异。值和当前值:

In [123]: df.groupby(['Account Number'])['Bill Amount'].diff()
Out[123]:
3      NaN
5      3.0
4     13.0
1      NaN
2    100.0
0    100.0
dtype: float64

fillna('-') - 使用指定值填充所有NaN:-

In [124]: df.groupby(['Account Number'])['Bill Amount'].diff().fillna('-')
Out[124]:
3      -
5      3
4     13
1      -
2    100
0    100
dtype: object

答案 1 :(得分:1)

df = pd.DataFrame({
    'Account Number': {0: 4543, 1: 4543, 2: 4543, 3: 2322, 4: 2322, 5: 2322},
    'Bill Amount': {0: 300.0, 1: 100.0, 2: 200.0, 3: 22.0, 4: 38.0, 5: 25.0},
    'Bill Month': {
        0: pd.Timestamp('2015-03-01 00:00:00'),
        1: pd.Timestamp('2015-01-01 00:00:00'),
        2: pd.Timestamp('2015-02-01 00:00:00'),
        3: pd.Timestamp('2015-01-01 00:00:00'),
        4: pd.Timestamp('2015-03-01 00:00:00'),
        5: pd.Timestamp('2015-02-01 00:00:00')}}

您可以对帐号和帐单月份(默认排序)进行分组,将帐单金额相加(或者如果保证每月只有一个帐单,则只取第一个),再次在第一个级别上分组索引(帐号),并使用diff获取差异。

>>> (df.groupby(['Account Number', 'Bill Month'])['Bill Amount']
       .sum()
       .groupby(level=0)
       .diff())
Account Number  Bill Month
2322            2015-01-01    NaN
                2015-02-01      3
                2015-03-01     13
4543            2015-01-01    NaN
                2015-02-01    100
                2015-03-01    100