Question

我正在编写一个以半大文件作为输入的过程（约4百万行，5列）并对其进行一些操作列：
- CARD_NO
- ID
- CREATED_DATE
- 状态
- FLAG2

我需要创建一个文件，其中包含每个CARD_NO的1个副本，其中STATUS ='1'，CREATED_DATE是该CARD_NO的所有CREATED_DATE的最大值。
我成功了，但我的解决方案非常缓慢（3小时，从现在算起。）
这是我的代码：

file = 'input.csv'
input = pd.read_csv(file)

input = input.drop_duplicates()


card_groups = input.groupby('CARD_NO', as_index=False, sort=False).filter(lambda x: x['STATUS'] == 1)


def important(x):
    latest_date = x['CREATED_DATE'].values[x['CREATED_DATE'].values.argmax()]
    return x[x.CREATED_DATE == latest_date]

#where the major slowdown occurs
group_2 = card_groups.groupby('CARD_NO', as_index=False, sort=False).apply(important)

path = 'result.csv'
group_2.to_csv(path, sep=',', index=False)
# ~4 minutes for the 154k rows file
# 3+ hours for ~4m rows

我想知道你是否对如何改善这个小过程的运行时间有任何建议谢谢你，祝你有个美好的一天。

Answer 1

设置（仅供参考，确保您在阅读csv时使用parse_dates=True

In [6]: n_groups = 10000

In [7]: N = 4000000

In [8]: dates = date_range('20130101',periods=100)

In [9]: df = DataFrame(dict(id = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))

In [10]: pd.set_option('max_rows',10)

In [13]: df = DataFrame(dict(card_no = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))

In [14]: df
Out[14]: 
         card_no       date  status
0           5790 2013-02-11       6
1           6572 2013-03-17       6
2           7764 2013-02-06       3
3           4905 2013-04-01       3
4           3871 2013-04-08       1
...          ...        ...     ...
3999995     1891 2013-02-16       5
3999996     9048 2013-01-11       9
3999997     1443 2013-02-23       1
3999998     2845 2013-01-28       0
3999999     5645 2013-02-05       8

[4000000 rows x 3 columns]

In [15]: df.dtypes
Out[15]: 
card_no             int64
date       datetime64[ns]
status              int64
dtype: object

仅状态== 1，groupby card_no，然后返回该组的最长日期

In [18]: df[df.status==1].groupby('card_no')['date'].max()
Out[18]: 
card_no
0         2013-04-06
1         2013-03-30
2         2013-04-09
...
9997      2013-04-07
9998      2013-04-07
9999      2013-04-09
Name: date, Length: 10000, dtype: datetime64[ns]

In [19]: %timeit df[df.status==1].groupby('card_no')['date'].max()
1 loops, best of 3: 934 ms per loop

如果您需要对此进行转换（例如，每个组的值相同。请注意，使用＆lt; 0.14.1（本周发布），您将需要使用此soln here，否则这将是相当的慢）

In [20]: df[df.status==1].groupby('card_no')['date'].transform('max')
Out[20]: 
4    2013-04-10
13   2013-04-10
25   2013-04-10
...
3999973   2013-04-10
3999979   2013-04-10
3999997   2013-04-09
Name: date, Length: 399724, dtype: datetime64[ns]

In [21]: %timeit df[df.status==1].groupby('card_no')['date'].transform('max')
1 loops, best of 3: 1.8 s per loop

我怀疑你想要将最终的变换合并回原始帧

In [24]: df.join(res.to_frame('max_date'))
Out[24]: 
         card_no       date  status   max_date
0           5790 2013-02-11       6        NaT
1           6572 2013-03-17       6        NaT
2           7764 2013-02-06       3        NaT
3           4905 2013-04-01       3        NaT
4           3871 2013-04-08       1 2013-04-10
...          ...        ...     ...        ...
3999995     1891 2013-02-16       5        NaT
3999996     9048 2013-01-11       9        NaT
3999997     1443 2013-02-23       1 2013-04-09
3999998     2845 2013-01-28       0        NaT
3999999     5645 2013-02-05       8        NaT

[4000000 rows x 4 columns]

In [25]: %timeit df.join(res.to_frame('max_date'))
10 loops, best of 3: 58.8 ms per loop

相对于此，csv写作实际上需要相当长的时间。我将HDF5用于这样的事情，速度更快。

优化Pandas groupby / apply

1 个答案: