我的数据集看起来像这样:
time raccoons_bought x y
22443 1984-01-01 00:00:01 1 55.776462 37.593956
2143 1984-01-01 00:00:01 4 55.757121 37.378225
9664 1984-01-01 00:00:33 3 55.773702 37.599220
33092 1984-01-01 00:01:39 3 55.757121 37.378225
16697 1984-01-01 00:02:32 2 55.678549 37.583023
我需要计算每天购买多少浣熊 我做了什么: 把时间作为指数
df = df.set_index(['time'])
按其排序数据集
df.groupby(df.index.date).count()
但在我排序之前,我需要删除表示坐标
的x和y列如果我不删除它,数据集将如下所示:
raccoons_bought x y
1984-01-01 5497 5497 5497
1984-01-02 5443 5443 5443
1984-01-03 5488 5488 5488
1984-01-04 5453 5453 5453
1984-01-05 5536 5536 5536
1984-01-06 5634 5634 5634
1984-01-07 5468 5468 5468
如果我删除它,数据集看起来会很好:
raccoons_bought
1984-01-01 5497
1984-01-02 5443
1984-01-03 5488
1984-01-04 5453
1984-01-05 5536
1984-01-06 5634
1984-01-07 5468
所以我的问题是如何计算每天的raccoons_bought并保持坐标不变,因为我想在地图上绘制这些坐标并查找谁购买了那些浣熊
答案 0 :(得分:3)
您可以这样做:
In [82]: df
Out[82]:
time raccoons_bought x y
22443 1984-01-01 00:00:01 1 55.776462 37.593956
2143 1984-01-01 00:00:01 4 55.757121 37.378225
9664 1984-01-01 00:00:33 3 55.773702 37.599220
33092 1984-01-01 00:01:39 3 55.757121 37.378225
16697 1984-01-01 00:02:32 2 55.678549 37.583023
In [83]: df.groupby(pd.to_datetime(df.time).dt.date).agg(
...: {'raccoons_bought': 'sum', 'x':'first', 'y':'first'}).reset_index()
Out[83]:
time y x raccoons_bought
0 1984-01-01 37.593956 55.776462 13
In [84]:
请注意,我使用sum
作为raccoons_bought
的聚合函数来获取总数,如果您只是需要将其更改为count
或size
答案 1 :(得分:1)
您可以使用:
#if necessary convert to datetime
df['time'] = pd.to_datetime(df['time'])
#thank you JoeCondron
# trim the timestamps to get the datetime object, faster
dates = df['time'].dt.floor('D')
#if necessary python date object, slowier
#dates = df['time'].dt.floor('D')
#aggregate size if want count NaNs
#aggregate count if want omit NaNs
df1 = df.groupby(dates).size()
print (df1)
time
1984-01-01 5
dtype: int64
#if need sums
df11 = df.groupby(dates)['raccoons_bought'].sum().reset_index()
print (df11)
time raccoons_bought
0 1984-01-01 13
如果不需要更改原始列需要transform
sum
(或size
或count
):
a = df.groupby(dates)['raccoons_bought'].transform('sum')
print (a)
22443 13
2143 13
9664 13
33092 13
16697 13
Name: raccoons_bought, dtype: int64
然后按条件过滤所有行:
mask = df.groupby(dates)['raccoons_bought'].transform('sum') > 4
df2 = df.loc[mask, 'raccoons_bought']
print (df2)
22443 1
2143 4
9664 3
33092 3
16697 2
Name: raccoons_bought, dtype: int64
如果列表中有必要的唯一值:
df2 = df.loc[mask, 'raccoons_bought'].unique().tolist()
print (df2)
[1, 4, 3, 2]