我有一个CSV
个月手机帐单的文件,我没有按照特定的顺序阅读Pandas
Dataframe
。我想为每个账单添加一个列,显示它与同一账户的先前账单有多大差异。此CSV只是我数据的子集。我的代码运行正常,但是当你查看接近一百万行的CSV文件时,它非常邋and且非常慢。
我应该怎样做才能提高效率?
CSV:
Account Number,Bill Month,Bill Amount
4543,3/1/2015,300
4543,1/1/2015,100
4543,2/1/2015,200
2322,1/1/2015,22
2322,3/1/2015,38
2322,2/1/2015,25
的Python:
import numpy as np
import pandas as pd
data = pd.read_csv('data.csv', low_memory=False)
# sort my data and reset the index so I can use index and index - 1 in the loop
data = data.sort_values(by=['Account Number', 'Bill Month'])
data = data.reset_index(drop=True)
# add a blank column for the difference
data['Difference'] = np.nan
for index, row in data.iterrows():
# special handling for the first row so I don't get negative indexes
if index == 0:
data.ix[index, 'Difference'] = "-"
else:
# if the account in the current row and the row before are the same, then compare Bill Amounts
if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']:
data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount']
else:
data.ix[index, 'Difference'] = "-"
print data
期望的输出:
Account Number Bill Month Bill Amount Difference
0 2322 1/1/2015 22 -
1 2322 2/1/2015 25 3
2 2322 3/1/2015 38 13
3 4543 1/1/2015 100 -
4 4543 2/1/2015 200 100
5 4543 3/1/2015 300 100
答案 0 :(得分:1)
试试这个:
In [37]: df = df.sort_values(['Account Number','Bill Month'])
In [38]: df['Difference'] = (df.groupby(['Account Number'])['Bill Amount']
....: .diff()
....: .fillna('-')
....: )
In [39]: df
Out[39]:
Account Number Bill Month Bill Amount Difference
3 2322 2015-01-01 22 -
5 2322 2015-02-01 25 3
4 2322 2015-03-01 38 13
1 4543 2015-01-01 100 -
2 4543 2015-02-01 200 100
0 4543 2015-03-01 300 100
说明:
diff()
将分别应用于每个组 - 它将返回" next"之间的差异。值和当前值:
In [123]: df.groupby(['Account Number'])['Bill Amount'].diff()
Out[123]:
3 NaN
5 3.0
4 13.0
1 NaN
2 100.0
0 100.0
dtype: float64
fillna('-')
- 使用指定值填充所有NaN:-
:
In [124]: df.groupby(['Account Number'])['Bill Amount'].diff().fillna('-')
Out[124]:
3 -
5 3
4 13
1 -
2 100
0 100
dtype: object
答案 1 :(得分:1)
df = pd.DataFrame({
'Account Number': {0: 4543, 1: 4543, 2: 4543, 3: 2322, 4: 2322, 5: 2322},
'Bill Amount': {0: 300.0, 1: 100.0, 2: 200.0, 3: 22.0, 4: 38.0, 5: 25.0},
'Bill Month': {
0: pd.Timestamp('2015-03-01 00:00:00'),
1: pd.Timestamp('2015-01-01 00:00:00'),
2: pd.Timestamp('2015-02-01 00:00:00'),
3: pd.Timestamp('2015-01-01 00:00:00'),
4: pd.Timestamp('2015-03-01 00:00:00'),
5: pd.Timestamp('2015-02-01 00:00:00')}}
您可以对帐号和帐单月份(默认排序)进行分组,将帐单金额相加(或者如果保证每月只有一个帐单,则只取第一个),再次在第一个级别上分组索引(帐号),并使用diff
获取差异。
>>> (df.groupby(['Account Number', 'Bill Month'])['Bill Amount']
.sum()
.groupby(level=0)
.diff())
Account Number Bill Month
2322 2015-01-01 NaN
2015-02-01 3
2015-03-01 13
4543 2015-01-01 NaN
2015-02-01 100
2015-03-01 100