我有一个pandas数据框df
,每行包含start_date
(也是索引)和duration
(以天为单位)订阅。
import pandas as pd
df = pd.DataFrame({'start_date':['2018-01-01','2018-01-05']})
df['start_date'] = df['start_date'].astype('datetime64[ns]')
df['duration'] = pd.to_timedelta([10,8], unit='D')
df['end_date'] = df['start_date'] + df['duration']
我想绘制一段时间内的订阅者数量。
我的想法是创建另一个数据框subscribers
:
active_subscribers = pd.DataFrame({
'Date': pd.date_range(start=df.index.min(),end=df['end_date'].max()),
'Number': 0,
})
active_subscribers.set_index('Date', inplace=True)
Date
涵盖至少一个订户处于活动状态的整个时间段。然后我在考虑为每个订阅创建日期范围,并将其添加到Number
列,如下所示:
for index, row in df.iterrows():
for this_date in pd.date_range(start=index, end=row['end_date']):
active_subscribers[this_date]['Number'] += 1
但这会返回以下错误:
KeyError: Timestamp('2018-01-01 00:00:00', freq='D')
我希望得到的是Number
列,如下所示:
Date Number
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2
2018-01-11 1
2018-01-12 1
2018-01-13 1
列Number
包含当天活跃订阅者的数量。
如果您有任何建议,请告诉我
答案 0 :(得分:0)
您可以将列表理解与一起用于新DataFrame
,然后通过itertuples
和groupby
获取新列:
df = pd.DataFrame(index=pd.to_datetime(['2018-01-01','2018-01-05']))
df['duration'] = pd.to_timedelta([10,8], unit='D')
df['end_date'] = df.index + df['duration']
print (df)
duration end_date
2018-01-01 10 days 2018-01-11
2018-01-05 8 days 2018-01-13
df = df.rename_axis('start_date').reset_index()
com = [pd.Series(r.Index,pd.date_range(r.start_date, r.end_date)) for r in df.itertuples()]
df1 = pd.concat(com).reset_index()
df1.columns=['Date','Number']
df1 = df1.groupby('Date')['Number'].size().reset_index()
print (df1)
Date Number
0 2018-01-01 1
1 2018-01-02 1
2 2018-01-03 1
3 2018-01-04 1
4 2018-01-05 2
5 2018-01-06 2
6 2018-01-07 2
7 2018-01-08 2
8 2018-01-09 2
9 2018-01-10 2
10 2018-01-11 2
11 2018-01-12 1
12 2018-01-13 1
iterrows
解决方案更快:
In [288]: %timeit (iterrows_sol(df))
10 loops, best of 3: 51.1 ms per loop
In [289]: %timeit (itertupl_sol(df))
100 loops, best of 3: 10.2 ms per loop
样品:
df = pd.DataFrame(index=pd.to_datetime(['2018-01-01','2018-01-05'] * 10))
df['duration'] = pd.to_timedelta([10,8,2,3,7,2,1,9,1,20,7,18,9,0,3,20,10,8,3,15] , unit='D')
df['end_date'] = df.index + df['duration']
print (df)
duration end_date
2018-01-01 10 days 2018-01-11
2018-01-05 8 days 2018-01-13
2018-01-01 2 days 2018-01-03
2018-01-05 3 days 2018-01-08
2018-01-01 7 days 2018-01-08
2018-01-05 2 days 2018-01-07
2018-01-01 1 days 2018-01-02
2018-01-05 9 days 2018-01-14
2018-01-01 1 days 2018-01-02
2018-01-05 20 days 2018-01-25
2018-01-01 7 days 2018-01-08
2018-01-05 18 days 2018-01-23
2018-01-01 9 days 2018-01-10
2018-01-05 0 days 2018-01-05
2018-01-01 3 days 2018-01-04
2018-01-05 20 days 2018-01-25
2018-01-01 10 days 2018-01-11
2018-01-05 8 days 2018-01-13
2018-01-01 3 days 2018-01-04
2018-01-05 15 days 2018-01-20
功能:
def iterrows_sol(df):
active_subscribers = pd.DataFrame({
'Date': pd.date_range(start=df.index.min(), end=df['end_date'].max()),
'Number': 0,})
active_subscribers.set_index('Date', inplace=True)
for index, row in df.iterrows():
for this_date in pd.date_range(start=index, end=row['end_date']):
active_subscribers.loc[this_date, 'Number'] += 1
return active_subscribers
def itertupl_sol(df):
df = df.rename_axis('start_date').reset_index()
com = [pd.Series(r.Index,pd.date_range(r.start_date,r.end_date)) for r in df.itertuples()]
df1 = pd.concat(com).reset_index()
df1.columns=['Date','Number']
df1 = df1.groupby('Date')['Number'].size().reset_index()
return (df1)