Question

也许我以错误的方式思考这个问题，但我想不出在熊猫中这么做的简单方法。我试图获得一个数据帧，该数据帧通过设定点以上的计数值与低于它的计数值之间的关系进行过滤。它更加复杂

已提交的示例：我们假设我有一个人员数据集及其在多项测试中的测试成绩：

Person | day | test score |
----------------------------
Bob      1     10
Bob      2     40
Bob      3     45
Mary     1     30
Mary     2     35
Mary     3     45

我想通过测试分数＆gt; = 40来过滤此数据帧，而不是每个人的总数。我们说我将阈值设置为50％。因此鲍勃将有2/3的考试成绩，但玛丽将获得1/3并将被排除在外。

我的最终目标是让groupby对象做手段/等。在那些符合门槛的人身上。所以在这种情况下它看起来像这样：

         test score
Person | above_count | total | score mean |
-------------------------------------------
Bob      2             3       31.67

我尝试过以下操作，但无法弄清楚我的groupby对象是做什么的。

df = pd.read_csv("all_data.csv")
gb  = df.groupby('Person')
df2 = df[df['test_score'] >= 40]
gb2 = df2.groupby('Person')

# This would get me the count for each person but how to compare it?
gb.size()

Answer 1

我认为使用groupby和aggregate来生成每个列为pd.Series，然后将它们粘贴在一起可能是有意义的。

df = pd.DataFrame([['Bob',1,10],['Bob',2,40],['Bob',3,45],
['Mary',1,30],['Mary',2,35],['Mary',3,45]], columns=
['Person','day', 'test score'])
df_group = df.groupby('Person')
above_count = df_group.apply(lambda x: x[x['test score'] >= 40]['test score'].count())
above_count.name = 'test score above_count'
total_count = df_group['test score'].agg(np.size)
total_count.name = 'total'
test_mean = df_group['test score'].agg(np.mean)
test_mean.name = 'score mean'
results = pd.concat([above_count, total_count, test_mean])

Answer 2

可以使用.agg()对groupby对象执行求和和均值，但阈值函数会强制您执行flexible apply。

未经测试，但这样的事情应该有效：

df.groupby('Person').apply(lambda x: sum(x > 40), sum(x), mean(x))

你可以使lambda函数成为一个更复杂的常规函数，它实现了你想要的所有标准/功能。

Answer 3

有一种简单的方法可以做到这一点......

System.gc();

您可以添加两列以便于计算数据。结果应该是这样的......

import pandas as pd
import numpy as np 

data = '''Bob      1     10
Bob      2     40
Bob      3     45
Mary     1     30
Mary     2     35
Mary     3     45'''

data = [d.split() for d in data.split('\n')]
data = pd.DataFrame(data, columns=['Name', 'day', 'score'])
data.score = data.score.astype(float)
data['pass']  = (data.score >=40)*1
data['total'] = 1

现在总结一下数据......

   Name day  score  pass  total
0   Bob   1     10     0      1
1   Bob   2     40     1      1
2   Bob   3     45     1      1
3  Mary   1     30     0      1
4  Mary   2     35     0      1
5  Mary   3     45     1      1

结果看起来像这样......

summary = data.groupby('Name').agg(np.sum).reset_index()
summary['mean score']  = summary['score']/summary['total']
summary['pass ratio'] = summary['pass']/summary['total']
print summary

现在，您始终可以根据合格率过滤掉名称......

Answer 4

import pandas as pd

df = pd.DataFrame({'Person': ['Bob'] * 3 + ['Mary'] * 4, 
                   'day': [1, 2, 3, 1, 2, 3, 4], 
                   'test_score': [10, 40, 45, 30, 35, 45, 55]})

>>> df
  Person  day  test_score
0    Bob    1          10
1    Bob    2          40
2    Bob    3          45
3   Mary    1          30
4   Mary    2          35
5   Mary    3          45
6   Mary    4          55

在groupby操作中，您可以通过字典传递不同的函数以在同一列上执行。

result =  df.groupby('Person').test_score.agg(
              {'total': pd.Series.count, 
               'test_score_above_mean': lambda s: s.ge(40).sum(), 
               'score mean': np.mean})
>>> result
        test_score_above_mean  total  score mean
Person                                          
Bob                         2      3   31.666667
Mary                        2      4   41.250000

>>> result[result.test_score_above_mean.gt(result.total * .5)]
        test_score_above_mean  total  score mean
Person                                          
Bob                         2      3   31.666667

如何比较pandas中的组大小

4 个答案: