熊猫:计算给定开始日期和结束日期之间每个月的天数

时间:2020-03-27 05:04:14

标签: python pandas datetime

我有一个带有一些开始日期和结束日期的熊猫数据框。

typedef

鉴于这些开始和结束日期,我需要计算开始和结束日期之间每个月有多少天。我找不到解决这个问题的好方法,但是结果数据框应该是这样的:

ActualStartDate ActualEndDate
0   2019-06-30  2019-08-15
1   2019-09-01  2020-01-01
2   2019-08-28  2019-11-13

请注意,实际数据框有约1,500行,其开始和结束日期不同。打开不同的df输出,但显示以上内容可以使您了解我需要完成的工作。预先感谢您的帮助!

3 个答案:

答案 0 :(得分:1)

想法是根据DatetimeIndex.to_perioddate_range创建月份,并按Index.value_counts计数,然后由concat创建DataFrame,并用{{3 }},最后由DataFrame.fillna加入原始网页:

L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
     for r in df.itertuples()}
df = df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
print (df)
  ActualStartDate ActualEndDate  2019-06  2019-07  2019-08  2019-09  2019-10  \
0      2019-06-30    2019-08-15        1       31       15        0        0   
1      2019-09-01    2020-01-01        0        0        0       30       31   
2      2019-08-28    2019-11-13        0        0        4       30       31   

   2019-11  2019-12  2020-01  
0        0        0        0  
1       30       31        1  
2       13        0        0  

性能

df = pd.concat([df] * 1000, ignore_index=True)

In [44]: %%timeit
    ...: L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
    ...:      for r in df.itertuples()}
    ...: df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
    ...: 
689 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %%timeit
    ...: df.join(
    ...:     df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
    ...:     .apply(pd.value_counts, axis=1)
    ...:     .fillna(0)
    ...:     .astype(int))
    ...:     
994 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 1 :(得分:1)

可能不是最有效的,但对于大约1500行来说应该不算太坏...扩展日期范围,然后将其转换为月度周期,取这些计数并重新加入到原始DF,例如:

res = df.join(
    df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
    .apply(pd.value_counts, axis=1)
    .fillna(0)
    .astype(int)
)

给你:

  ActualStartDate ActualEndDate  2019-06  2019-07  2019-08  2019-09  2019-10  2019-11  2019-12  2020-01  2020-02  2020-03  2020-04  2020-05  2020-06  2020-07  2020-08  2020-09  2020-10  2020-11
0      2019-06-30    2020-08-15        1       31       31       30       31       30       31       31       29       31       30       31       30       31       15        0        0        0
1      2019-09-01    2020-01-01        0        0        0       30       31       30       31        1        0        0        0        0        0        0        0        0        0        0
2      2019-08-28    2020-11-13        0        0        4       30       31       30       31       31       29       31       30       31       30       31       31       30       31       13

答案 2 :(得分:0)

import pandas as pd
import calendar

date_info = pd.DataFrame({
    'ActualStartDate': [
        pd.Timestamp('2019-06-30'),
        pd.Timestamp('2019-09-01'),
        pd.Timestamp('2019-08-28'),
    ],
    'ActualEndDate': [
        pd.Timestamp('2019-08-15'),
        pd.Timestamp('2020-01-01'),
        pd.Timestamp('2019-11-13'),
    ]
})

# ============================================================

result = {}  # result should in dict, in case of too many cols.
for index, timepair in date_info.iterrows():
    start = timepair['ActualStartDate']
    end = timepair['ActualEndDate']

    current  = start
    result[index] = {}  # delta days in this pair
    while True:
        # find the delta days
        # current day is also count, so should + 1
        _, days = calendar.monthrange(current.year, current.month)
        days = min(days, (end - current).days + 1)
        delta = days - current.day + 1

        result[index]['%s-%s'%(current.year, current.month)] = delta
        current += pd.Timedelta(delta, unit='d')

        if current >= end:
            break

# you can save the result in dataframe, if you insisit
columns = set()
for value in result.values():
    columns.update(value.keys())

for col in columns:
    date_info[col] = 0

for index, delta in result.items():
    for date, days in delta.items():
        date_info.loc[index, date] = days

print(date_info)