让我们从熊猫数据帧开始吧:
>>> df
Date
0 2006-01-30
1 2006-02-02
2 2006-02-03
3 2006-02-04
4 2006-02-21
5 2006-02-23
6 2006-03-07
7 2006-03-11
8 2006-04-24
9 2006-04-25
我想添加一个新列,该列是该行中日期的前一个月之内的日期数量,例如:(这有意义吗?)
对于日期“ 2006-02-23”,我想要介于“ 2006-01-23”和“ 2006-02-22”之间的日期数
>>> df
Date Past_Month
0 2006-01-30 0
1 2006-02-02 1
2 2006-02-03 2
3 2006-02-04 3
4 2006-02-21 4
5 2006-02-23 5
6 2006-03-07 2
7 2006-03-11 3
8 2006-04-24 0
9 2006-04-25 1
现在,我可以使用下面的代码,但是对于我的数据大小,它的运行速度很慢。什么是更有效的方法?
for i in range(len(df)):
days = (df['Date'] >= df['Date'][i] + pd.DateOffset(months=-1))
& (df['Date'] < df['Date'][i])
df.loc[i,'Past_Month'] = days.sum()
答案 0 :(得分:1)
您可以尝试np的广播:
offset = df.Date + pd.DateOffset(months=-1)
df['Past_Month'] = np.sum((df.Date.values > offset.values[:,None])
& (df.Date.values < df.Date.values[:, None]),
axis=1)
输出:
Date Past_Month
-- ------------------- ------------
0 2006-01-30 00:00:00 0
1 2006-02-02 00:00:00 1
2 2006-02-03 00:00:00 2
3 2006-02-04 00:00:00 3
4 2006-02-21 00:00:00 4
5 2006-02-23 00:00:00 5
6 2006-03-07 00:00:00 2
7 2006-03-11 00:00:00 3
8 2006-04-24 00:00:00 0
9 2006-04-25 00:00:00 1
这基本上可以完成您的代码,但是由np
向量化。广播的作用是获取一个数组,并沿duplicate
沿另一个维度进行,而无需使用额外的内存。示例:
ar = np.array([0,1,2,3])
ar
array([0, 1, 2, 3])
ar[:,None]
array([[0],
[1],
[2],
[3]])
# then this compares every member of one array to every member of the other
ar < ar[:,None]
array([[False, False, False, False],
[ True, False, False, False],
[ True, True, False, False],
[ True, True, True, False]])
# now you have that, then you do the sum in your code
np.sum(ar < ar[:,None], axis=1)
希望有帮助。