Question

让我们从熊猫数据帧开始吧：

>>> df
     Date
0    2006-01-30
1    2006-02-02
2    2006-02-03
3    2006-02-04
4    2006-02-21
5    2006-02-23
6    2006-03-07
7    2006-03-11
8    2006-04-24
9    2006-04-25

我想添加一个新列，该列是该行中日期的前一个月之内的日期数量，例如：（这有意义吗？）

对于日期“ 2006-02-23”，我想要介于“ 2006-01-23”和“ 2006-02-22”之间的日期数

>>> df
     Date          Past_Month
0    2006-01-30    0
1    2006-02-02    1
2    2006-02-03    2
3    2006-02-04    3
4    2006-02-21    4
5    2006-02-23    5
6    2006-03-07    2
7    2006-03-11    3
8    2006-04-24    0
9    2006-04-25    1

现在，我可以使用下面的代码，但是对于我的数据大小，它的运行速度很慢。什么是更有效的方法？

for i in range(len(df)):

    days = (df['Date'] >= df['Date'][i] + pd.DateOffset(months=-1)) 
        & (df['Date'] < df['Date'][i])

    df.loc[i,'Past_Month'] = days.sum()

Answer 1

您可以尝试np的广播：

offset = df.Date + pd.DateOffset(months=-1)
df['Past_Month'] = np.sum((df.Date.values > offset.values[:,None]) 
                          & (df.Date.values < df.Date.values[:, None]),
                          axis=1)

输出：

    Date                   Past_Month
--  -------------------  ------------
 0  2006-01-30 00:00:00             0
 1  2006-02-02 00:00:00             1
 2  2006-02-03 00:00:00             2
 3  2006-02-04 00:00:00             3
 4  2006-02-21 00:00:00             4
 5  2006-02-23 00:00:00             5
 6  2006-03-07 00:00:00             2
 7  2006-03-11 00:00:00             3
 8  2006-04-24 00:00:00             0
 9  2006-04-25 00:00:00             1

这基本上可以完成您的代码，但是由np向量化。广播的作用是获取一个数组，并沿duplicate沿另一个维度进行，而无需使用额外的内存。示例：

ar = np.array([0,1,2,3])
ar

array([0, 1, 2, 3])

ar[:,None]

array([[0],
       [1],
       [2],
       [3]])

# then this compares every member of one array to every member of the other
ar < ar[:,None]

array([[False, False, False, False],
       [ True, False, False, False],
       [ True,  True, False, False],
       [ True,  True,  True, False]])

# now you have that, then you do the sum in your code
np.sum(ar < ar[:,None], axis=1)

希望有帮助。

为上个月内的行数添加新列

1 个答案: