我有一个如下所示的数据框:
In [60]: df1
Out[60]:
DIFF UID
0 NaN 1
1 13.0 1
2 4.0 1
3 NaN 2
4 3.0 2
5 23.0 2
6 NaN 3
7 4.0 3
8 29.0 3
9 42.0 3
10 NaN 4
11 3.0 4
并且对于每个UID
,我想计算在给定的参数上找到的DIFF
值的实例数。
我尝试过这样的事情:
In [61]: threshold = 5
In [62]: df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reset_index().rename(columns={'DIFF':'ATTR_NAME'})
Out[63]:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
在查找每个用户返回正确的实例数等方面工作正常。但是,我希望能够包含具有0个实例的用户,这些用户现在被排除在{{1}之外部分。
所需的输出是:
df1[df1.DIFF > threshold]
有什么想法吗?
由于
答案 0 :(得分:1)
简单,使用.reindex
:
req = df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count()
req = req.reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
在一行中:
df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
答案 1 :(得分:0)
另一种方法是使用apply()
函数来执行此操作:
In [101]: def count_instances(x, threshold):
counter = 0
for i in x:
if i > threshold: counter += 1
return counter
.....:
In [102]: df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
Out[102]:
UID DIFF
0 1 1
1 2 1
2 3 2
3 4 0
看来这种方式也快了一点:
In [103]: %timeit df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
100 loops, best of 3: 2.38 ms per loop
In [104]: %timeit df1[df1.DIFF > 5].groupby('UID')['DIFF'].count().reset_index()
100 loops, best of 3: 2.39 ms per loop
答案 2 :(得分:0)
答案 3 :(得分:0)
搜索以计算匹配条件的值的数量而不过滤掉没有匹配的键相当于计算每组True
个匹配的数量,这可以使用sum
布尔值来完成值:
(df1.DIFF > 5).groupby(df1.UID).sum().reset_index()
UID DIFF
0 1 1.0
1 2 1.0
2 3 2.0
3 4 0.0