我有一个带有一些开始日期和结束日期的熊猫数据框。
typedef
鉴于这些开始和结束日期,我需要计算开始和结束日期之间每个月有多少天。我找不到解决这个问题的好方法,但是结果数据框应该是这样的:
ActualStartDate ActualEndDate
0 2019-06-30 2019-08-15
1 2019-09-01 2020-01-01
2 2019-08-28 2019-11-13
请注意,实际数据框有约1,500行,其开始和结束日期不同。打开不同的df输出,但显示以上内容可以使您了解我需要完成的工作。预先感谢您的帮助!
答案 0 :(得分:1)
想法是根据DatetimeIndex.to_period
从date_range
创建月份,并按Index.value_counts
计数,然后由concat
创建DataFrame
,并用{{3 }},最后由DataFrame.fillna
加入原始网页:
L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
for r in df.itertuples()}
df = df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
print (df)
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 \
0 2019-06-30 2019-08-15 1 31 15 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31
2 2019-08-28 2019-11-13 0 0 4 30 31
2019-11 2019-12 2020-01
0 0 0 0
1 30 31 1
2 13 0 0
性能:
df = pd.concat([df] * 1000, ignore_index=True)
In [44]: %%timeit
...: L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
...: for r in df.itertuples()}
...: df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
...:
689 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %%timeit
...: df.join(
...: df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
...: .apply(pd.value_counts, axis=1)
...: .fillna(0)
...: .astype(int))
...:
994 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:1)
可能不是最有效的,但对于大约1500行来说应该不算太坏...扩展日期范围,然后将其转换为月度周期,取这些计数并重新加入到原始DF,例如:
res = df.join(
df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int)
)
给你:
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11
0 2019-06-30 2020-08-15 1 31 31 30 31 30 31 31 29 31 30 31 30 31 15 0 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31 30 31 1 0 0 0 0 0 0 0 0 0 0
2 2019-08-28 2020-11-13 0 0 4 30 31 30 31 31 29 31 30 31 30 31 31 30 31 13
答案 2 :(得分:0)
import pandas as pd
import calendar
date_info = pd.DataFrame({
'ActualStartDate': [
pd.Timestamp('2019-06-30'),
pd.Timestamp('2019-09-01'),
pd.Timestamp('2019-08-28'),
],
'ActualEndDate': [
pd.Timestamp('2019-08-15'),
pd.Timestamp('2020-01-01'),
pd.Timestamp('2019-11-13'),
]
})
# ============================================================
result = {} # result should in dict, in case of too many cols.
for index, timepair in date_info.iterrows():
start = timepair['ActualStartDate']
end = timepair['ActualEndDate']
current = start
result[index] = {} # delta days in this pair
while True:
# find the delta days
# current day is also count, so should + 1
_, days = calendar.monthrange(current.year, current.month)
days = min(days, (end - current).days + 1)
delta = days - current.day + 1
result[index]['%s-%s'%(current.year, current.month)] = delta
current += pd.Timedelta(delta, unit='d')
if current >= end:
break
# you can save the result in dataframe, if you insisit
columns = set()
for value in result.values():
columns.update(value.keys())
for col in columns:
date_info[col] = 0
for index, delta in result.items():
for date, days in delta.items():
date_info.loc[index, date] = days
print(date_info)