Question

我想根据两列中的日期范围将一列中的所有值相加：

Start_Date  Value_to_sum  End_date
2017-12-13    2          2017-12-13
2017-12-13    3          2017-12-16 
2017-12-14    4          2017-12-15
2017-12-15    2          2017-12-15

一个简单的groupby不会这样做，因为它只会添加特定日期的值。

我们可以为循环做一个embeeded，但它需要永远运行：

unique_date = carry.Start_Date.unique()
carry = pd.DataFrame({'Date':unique_date})
carry['total'] = 0
for n in tqdm(range(len(carry))):
    tr = data.loc[data['Start_Date'] >= carry['Date'][n]]
    for i in tr.index:
        if carry['Date'][n] <= tr['End_date'][i]:
                carry['total'][n] += tr['Value_to_sum'][i]

这样的东西会起作用，但就像我说的那样会永远。

预期输出是唯一日期，每天的总数。

这将是

2017-12-13 = 5, 2017-12-14 = 7, 2017-12-15 = 9.

如何根据日期范围计算总和？

Answer 1

不幸的是，我不相信没有涉及至少一个循环就可以做到这一点。您正在尝试查看日期是否在开始日期和结束日期之间。如果是，您想要对unique_date = df.Start_Date.unique() for d in unique_date: # create a mask which will give us all the rows # that we want to sum over # then apply the mask and take the sum of the Value_to_sum column m = (df.Start_Date <= d) & (df.End_date >= d) print(d, df[m].Value_to_sum.sum())列求和。我们可以让你的循环更有效率。

您可以为每个唯一日期创建掩码，并查找符合条件的所有行。然后应用该掩码并获取所有匹配行的总和。这应该比单独迭代每一行并确定要增加的日期计数器快得多。

2017-12-13 5
2017-12-14 7
2017-12-15 9

这为您提供了所需的输出：

{{1}}

其他人可能会想出一种聪明的方法来对整个事物进行矢量化，但我还没有找到办法。

Answer 2

如果你希望总和是原始数据帧的一部分，你可以使用apply来迭代每一行（但这可能不是最优化的代码，因为你在计算每一行的总和）

carry['total'] = carry.apply(lambda current_row: carry.loc[(carry['Start_Date'] <= current_row.Start_Date) & (carry['End_date'] >= current_row.Start_Date)].Value_to_sum.sum(),axis=1)

将导致

>>> print(carry)
     End_date  Start_Date  Value_to_sum  total
0  2017-12-13  2017-12-13             2      5
1  2017-12-16  2017-12-13             3      5
2  2017-12-15  2017-12-14             4      7
3  2017-12-15  2017-12-15             2      9

Answer 3

首先，按[＆＃34; Start_Date＆＃34;，＆＃34; End_date＆＃34;]进行分组以保存一些操作。

from collections import Counter
c = Counter()
df_g = df.groupby(["Start_Date", "End_date"]).sum().reset_index()

def my_counter(row):
    s, v, e = row.Start_Date, row.Value_to_sum, row.End_date
    if s == e:
        c[pd.Timestamp(s, freq="D")] += row.Value_to_sum
    else:
         c.update({date: v for date in pd.date_range(s, e)})

df_g.apply(my_counter, axis=1) 
print(c)
"""
Counter({Timestamp('2017-12-15 00:00:00', freq='D'): 9,
     Timestamp('2017-12-14 00:00:00', freq='D'): 7,
     Timestamp('2017-12-13 00:00:00', freq='D'): 5,
     Timestamp('2017-12-16 00:00:00', freq='D'): 3})
"""

使用的工具：

Counter.update（[迭代-或映射）：元素从可迭代计数或从另一个映射（或计数器）添加。与dict.update（）类似，但添加计数而不是替换它们。此外，期望迭代是元素序列，而不是（键，值）对的序列。 - 引自Python 3 Documentation

pandas.date_range

基于两个单独列中的日期范围求和

3 个答案: