我有两个日期列“ StartDate”和“ EndDate”。我想查找从2019年12月起的两个日期之间每个月的天数,然后忽略2019年之前的任何几个月进行计算。每行的StartDate和EndDate可以跨越2年,并且月份重叠,并且Date列也可以为空。
样本数据:
df = {'Id': ['1','2','3','4','5','6','7', '8'],
'Item': ['A','B','C','D','E','F','G', 'H'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-10-01', '2020-01-01', '2019-03-01','2019-03-01','2019-10-01', ''],
'EndDate': ['2020-02-21' ,'2020-01-01','2020-08-31','2020-01-30','2019-12-31','2019-12-31','2020-08-31', '']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
预期的O / P:
以下解决方案部分有效。
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.month)
df1 = df[['StartDate', 'EndDate']].apply(days_of_month, axis=1).fillna(0)
df_final = df[['StartDate', 'EndDate']].join([df['StartDate'].dt.year.rename('Year'), df1])
答案 0 :(得分:2)
尝试一下:
df.join(df.dropna(axis=0,how='any')
.apply(lambda x: pd.date_range(x['StartDate'],x['EndDate'], freq='D')
.to_frame().resample('M').count().loc['2019-12-01':].unstack(), axis=1)[0].fillna(0))
输出:
Id Item StartDate EndDate 2019-12-31 00:00:00 2020-01-31 00:00:00 2020-02-29 00:00:00 2020-03-31 00:00:00 2020-04-30 00:00:00 2020-05-31 00:00:00 2020-06-30 00:00:00 2020-07-31 00:00:00 2020-08-31 00:00:00
0 1 A 2019-12-10 2020-02-21 22.0 31.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 B 2019-12-01 2020-01-01 31.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 C 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0
3 4 D 2020-01-01 2020-01-30 0.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 E 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6 F 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7 G 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0
7 8 H NaT NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN
答案 1 :(得分:2)
我们将创建两个大的DataFrame,一个在每个月初,另一个在每个月底。然后,我们将它们相应地裁剪,这给我们提供了一个简单的减法。由于您要添加结束日期,因此我们需要添加1天,然后我们清除所有负日期,该日期应为0。
import pandas as pd
df_s = pd.DataFrame([pd.date_range('2019-12-01', '2020-12-01', freq='MS').to_numpy()],
index=df.index)
df_e = df_s + pd.offsets.MonthEnd(1)
df_s = df_s.clip(lower=pd.to_datetime(df.StartDate), axis=0)
df_e = df_e.clip(upper=pd.to_datetime(df.EndDate), axis=0)
res = ((df_e - df_s) + pd.to_timedelta(1, 'd')).clip(lower=pd.to_timedelta(0, 'd'))
res.columns = pd.period_range(start='2019-12', end='2020-12', freq='M')
# So int or float
for col in res.columns:
res[col] = res[col].dt.days
df = pd.concat([df, res], axis=1)
Id Item StartDate EndDate 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11 2020-12
0 1 A 2019-12-10 2020-02-21 22.0 31.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 B 2019-12-01 2020-01-31 31.0 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 C 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0 0.0 0.0 0.0 0.0
3 4 D 2020-01-01 2020-01-30 0.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 E 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6 F 2019-03-01 2019-12-31 31.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 7 G 2019-10-01 2020-08-31 31.0 31.0 29.0 31.0 30.0 31.0 30.0 31.0 31.0 0.0 0.0 0.0 0.0
7 8 H NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
答案 2 :(得分:2)
这是另一种方法,创建全天列表,并计算与广播的重叠:
dates = pd.date_range('2019-12-01', '2020-12-31', freq='D').values
(pd.DataFrame((df.StartDate.values <= dates[:,None])
& (df.EndDate.values >= dates[:,None]),
index=dates)
.resample('M')
.sum()
.T
)
输出:
2019-12-31 00:00:00 2020-01-31 00:00:00 2020-02-29 00:00:00 2020-03-31 00:00:00 2020-04-30 00:00:00 2020-05-31 00:00:00 2020-06-30 00:00:00 2020-07-31 00:00:00 2020-08-31 00:00:00 2020-09-30 00:00:00 2020-10-31 00:00:00 2020-11-30 00:00:00 2020-12-31 00:00:00
-- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
0 22 31 21 0 0 0 0 0 0 0 0 0 0
1 31 1 0 0 0 0 0 0 0 0 0 0 0
2 31 31 29 31 30 31 30 31 31 0 0 0 0
3 0 30 0 0 0 0 0 0 0 0 0 0 0
4 31 0 0 0 0 0 0 0 0 0 0 0 0
5 31 0 0 0 0 0 0 0 0 0 0 0 0
6 31 31 29 31 30 31 30 31 31 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0
答案 3 :(得分:1)
使用相同的代码,将coerce
添加到to_datetime
和dropna
并更改rename
部分
df['StartDate'] = pd.to_datetime(df['StartDate'], errors='coerce')
df['EndDate'] = pd.to_datetime(df['EndDate'], errors='coerce')
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.to_period(freq='M'))
df1 = (df[['StartDate', 'EndDate']].dropna().apply(days_of_month, axis=1)
.reindex(df.index).fillna(0))
df_final = df.join(df1)
Out[1205]:
Id Item StartDate EndDate 2019-03 2019-04 2019-05 2019-06 2019-07 \
0 1 A 2019-12-10 2020-02-21 0.0 0.0 0.0 0.0 0.0
1 2 B 2019-12-01 2020-01-01 0.0 0.0 0.0 0.0 0.0
2 3 C 2019-10-01 2020-08-31 0.0 0.0 0.0 0.0 0.0
3 4 D 2020-01-01 2020-01-30 0.0 0.0 0.0 0.0 0.0
4 5 E 2019-03-01 2019-12-31 31.0 30.0 31.0 30.0 31.0
5 6 F 2019-03-01 2019-12-31 31.0 30.0 31.0 30.0 31.0
6 7 G 2019-10-01 2020-08-31 0.0 0.0 0.0 0.0 0.0
7 8 H NaT NaT 0.0 0.0 0.0 0.0 0.0
2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 \
0 0.0 0.0 0.0 0.0 22.0 31.0 21.0 0.0
1 0.0 0.0 0.0 0.0 31.0 1.0 0.0 0.0
2 0.0 0.0 31.0 30.0 31.0 31.0 29.0 31.0
3 0.0 0.0 0.0 0.0 0.0 30.0 0.0 0.0
4 31.0 30.0 31.0 30.0 31.0 0.0 0.0 0.0
5 31.0 30.0 31.0 30.0 31.0 0.0 0.0 0.0
6 0.0 0.0 31.0 30.0 31.0 31.0 29.0 31.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2020-04 2020-05 2020-06 2020-07 2020-08
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 30.0 31.0 30.0 31.0 31.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0
6 30.0 31.0 30.0 31.0 31.0
7 0.0 0.0 0.0 0.0 0.0