我在Pandas DataFrame
中有下表:
q_string q_visits q_date
0 nucleus 1790 2012-10-02 00:00:00
1 neuron 364 2012-10-02 00:00:00
2 current 280 2012-10-02 00:00:00
3 molecular 259 2012-10-02 00:00:00
4 stem 201 2012-10-02 00:00:00
该表包含来自服务器日志的查询量,按天计算。我想做两件事:
q_visits
。即,术语的月度查询量除以所有术语中月份的总查询量。这样做的最佳方式是什么?
答案 0 :(得分:31)
如果我理解正确的话:
对于(1)这样做:
通过从您提供的值和一些随机日期和访问次数中抽样来制作一些虚假数据:
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
In [180]: visits = Series(poisson(1000, size=100), name='date')
In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
In [183]: df.head()
Out[183]:
date string visits
0 2001-11-15 00:00:00 current 997
1 2001-11-15 00:00:00 current 974
2 2012-10-02 00:00:00 stem 982
3 2001-12-01 00:00:00 stem 984
4 2001-01-01 00:00:00 current 989
In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
In [187]: resamp.head()
Out[187]:
visits
string date
current 2001-01-31 2996
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 3016
NaN
因为那些月份没有访问该查询字符串。
对于(2),按日期分组然后除以总和:
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
In [189]: g.head()
Out[189]:
visits
string date
current 2001-01-31 0.177
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 0.188
只是为了说服你(2)做你想做的事:
In [176]: h = g.sortlevel('date').head()
In [177]: h
Out[177]:
visits
string date
current 2001-01-31 0.077
molecular 2001-01-31 0.228
neuron 2001-01-31 0.073
nucleus 2001-01-31 0.234
stem 2001-01-31 0.388
In [178]: h.sum()
Out[178]:
visits 1
dtype: float64
如果您想将resamp
转换为DataFrame
并删除NaN
,请执行以下操作:
In [196]: resamp.dropna()
Out[196]:
visits
string date
current 2001-01-31 2996
2001-05-31 3016
2001-11-30 5959
2001-12-31 3998
2013-09-30 1077
molecular 2001-01-31 3984
2001-05-31 1911
2001-11-30 3054
2001-12-31 1020
2012-10-31 977
2013-09-30 1947
neuron 2001-01-31 3961
2001-05-31 2069
2001-11-30 5010
2001-12-31 2065
2012-10-31 6973
2013-09-30 994
nucleus 2001-01-31 3060
2001-05-31 3035
2001-11-30 2924
2001-12-31 4144
2012-10-31 2004
2013-09-30 7881
stem 2001-01-31 2911
2001-05-31 5994
2001-11-30 6072
2001-12-31 4916
2012-10-31 1991
2013-09-30 3977
In [197]: resamp.dropna().reset_index()
Out[197]:
string date visits
0 current 2001-01-31 00:00:00 2996
1 current 2001-05-31 00:00:00 3016
2 current 2001-11-30 00:00:00 5959
3 current 2001-12-31 00:00:00 3998
4 current 2013-09-30 00:00:00 1077
5 molecular 2001-01-31 00:00:00 3984
6 molecular 2001-05-31 00:00:00 1911
7 molecular 2001-11-30 00:00:00 3054
8 molecular 2001-12-31 00:00:00 1020
9 molecular 2012-10-31 00:00:00 977
10 molecular 2013-09-30 00:00:00 1947
11 neuron 2001-01-31 00:00:00 3961
12 neuron 2001-05-31 00:00:00 2069
13 neuron 2001-11-30 00:00:00 5010
14 neuron 2001-12-31 00:00:00 2065
15 neuron 2012-10-31 00:00:00 6973
16 neuron 2013-09-30 00:00:00 994
17 nucleus 2001-01-31 00:00:00 3060
18 nucleus 2001-05-31 00:00:00 3035
19 nucleus 2001-11-30 00:00:00 2924
20 nucleus 2001-12-31 00:00:00 4144
21 nucleus 2012-10-31 00:00:00 2004
22 nucleus 2013-09-30 00:00:00 7881
23 stem 2001-01-31 00:00:00 2911
24 stem 2001-05-31 00:00:00 5994
25 stem 2001-11-30 00:00:00 6072
26 stem 2001-12-31 00:00:00 4916
27 stem 2012-10-31 00:00:00 1991
28 stem 2013-09-30 00:00:00 3977
您当然可以为g
执行此操作:
In [198]: g.dropna()
Out[198]:
visits
string date
current 2001-01-31 0.177
2001-05-31 0.188
2001-11-30 0.259
2001-12-31 0.248
2013-09-30 0.068
molecular 2001-01-31 0.236
2001-05-31 0.119
2001-11-30 0.133
2001-12-31 0.063
2012-10-31 0.082
2013-09-30 0.123
neuron 2001-01-31 0.234
2001-05-31 0.129
2001-11-30 0.218
2001-12-31 0.128
2012-10-31 0.584
2013-09-30 0.063
nucleus 2001-01-31 0.181
2001-05-31 0.189
2001-11-30 0.127
2001-12-31 0.257
2012-10-31 0.168
2013-09-30 0.496
stem 2001-01-31 0.172
2001-05-31 0.374
2001-11-30 0.264
2001-12-31 0.305
2012-10-31 0.167
2013-09-30 0.251
In [199]: g.dropna().reset_index()
Out[199]:
string date visits
0 current 2001-01-31 00:00:00 0.177
1 current 2001-05-31 00:00:00 0.188
2 current 2001-11-30 00:00:00 0.259
3 current 2001-12-31 00:00:00 0.248
4 current 2013-09-30 00:00:00 0.068
5 molecular 2001-01-31 00:00:00 0.236
6 molecular 2001-05-31 00:00:00 0.119
7 molecular 2001-11-30 00:00:00 0.133
8 molecular 2001-12-31 00:00:00 0.063
9 molecular 2012-10-31 00:00:00 0.082
10 molecular 2013-09-30 00:00:00 0.123
11 neuron 2001-01-31 00:00:00 0.234
12 neuron 2001-05-31 00:00:00 0.129
13 neuron 2001-11-30 00:00:00 0.218
14 neuron 2001-12-31 00:00:00 0.128
15 neuron 2012-10-31 00:00:00 0.584
16 neuron 2013-09-30 00:00:00 0.063
17 nucleus 2001-01-31 00:00:00 0.181
18 nucleus 2001-05-31 00:00:00 0.189
19 nucleus 2001-11-30 00:00:00 0.127
20 nucleus 2001-12-31 00:00:00 0.257
21 nucleus 2012-10-31 00:00:00 0.168
22 nucleus 2013-09-30 00:00:00 0.496
23 stem 2001-01-31 00:00:00 0.172
24 stem 2001-05-31 00:00:00 0.374
25 stem 2001-11-30 00:00:00 0.264
26 stem 2001-12-31 00:00:00 0.305
27 stem 2012-10-31 00:00:00 0.167
28 stem 2013-09-30 00:00:00 0.251
最后,如果您想以不同的顺序排列列,请使用reindex
:
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
visits string date
0 0.177 current 2001-01-31 00:00:00
1 0.188 current 2001-05-31 00:00:00
2 0.259 current 2001-11-30 00:00:00
3 0.248 current 2001-12-31 00:00:00
4 0.068 current 2013-09-30 00:00:00
5 0.236 molecular 2001-01-31 00:00:00
6 0.119 molecular 2001-05-31 00:00:00
7 0.133 molecular 2001-11-30 00:00:00
8 0.063 molecular 2001-12-31 00:00:00
9 0.082 molecular 2012-10-31 00:00:00
10 0.123 molecular 2013-09-30 00:00:00
11 0.234 neuron 2001-01-31 00:00:00
12 0.129 neuron 2001-05-31 00:00:00
13 0.218 neuron 2001-11-30 00:00:00
14 0.128 neuron 2001-12-31 00:00:00
15 0.584 neuron 2012-10-31 00:00:00
16 0.063 neuron 2013-09-30 00:00:00
17 0.181 nucleus 2001-01-31 00:00:00
18 0.189 nucleus 2001-05-31 00:00:00
19 0.127 nucleus 2001-11-30 00:00:00
20 0.257 nucleus 2001-12-31 00:00:00
21 0.168 nucleus 2012-10-31 00:00:00
22 0.496 nucleus 2013-09-30 00:00:00
23 0.172 stem 2001-01-31 00:00:00
24 0.374 stem 2001-05-31 00:00:00
25 0.264 stem 2001-11-30 00:00:00
26 0.305 stem 2001-12-31 00:00:00
27 0.167 stem 2012-10-31 00:00:00
28 0.251 stem 2013-09-30 00:00:00