Python:Groupby小时包含空值

时间:2016-05-02 16:37:24

标签: python pandas group-by

使用citibike数据:https://s3.amazonaws.com/tripdata/index.html

tripduration    starttime   stoptime    start_station_id    start_station_name  start_station_latitude  start_station_longitude end_station_id  end_station_name    end_station_latitude    end_station_longitude   bikeid  usertype    birth_year  gender
461 2016-02-01 00:00:08 2016-02-01 00:07:49 480 W 53 St & 10 Ave    40.766697   -73.990617  524 W 43 St & 6 Ave 40.755273   -73.983169  23292   Subscriber  1966.0  1
297 2016-02-01 00:00:56 2016-02-01 00:05:53 463 9 Ave & W 16 St 40.742065   -74.004432  380 W 4 St & 7 Ave S    40.734011   -74.002939  15329   Subscriber  1977.0  1  
280 2016-02-01 00:01:00 2016-02-01 00:05:40 3134    3 Ave & E 62 St 40.763126   -73.965269  3141    1 Ave & E 68 St 40.765005   -73.958185  22927   Subscriber  1987.0  1

使用Groupby函数按小时分组,我想将空值包含为零。

我使用了以下代码:

bikes_parked = df.groupby(['end_station_name',pd.Grouper(key='stoptime',freq='H')]).size().reset_index()
bikes_parked.rename(columns={0: 'bikes_parked'},inplace=True)

按小时计算停放的自行车数量,但是没有数据的小时数会跳过。

输出:

    end_station_name    stoptime               bikes_parked
0   1 Ave & E 15 St     2016-02-01 00:00:00    1
1   1 Ave & E 15 St     2016-02-01 05:00:00    1
2   1 Ave & E 15 St     2016-02-01 06:00:00    3

我想包括停止时间01,02,03,04,bikes_parked也是0。

1 个答案:

答案 0 :(得分:0)

正如评论中所提到的,解决方案是这样的:

1)创建一个包含整个小时范围的DataFrame,全部设置为bikes_parked=0

2)使用以下方法使用分组表中的相关数据更新此DF:

df.loc[bikes_parked.index, 'bikes_parked'] = bikes_parked.bikes_parked