熊猫分组依据和对数据集的判断

时间:2019-07-12 13:46:30

标签: python pandas

我有一个数据框,其中某些行被分类为“通过”或“失败”。我正在尝试根据项目的通过/失败次数对它们进行总体判断。

pandas ver 23.4

给出以下df:

*注意:还有其他几列,但为此目的,只有这两列很重要

Name    Judgement
A        Pass
A        Fail
A        Fail
A        Pass
X        Pass
X        Pass
Z        Pass
Z        Pass
Z        Fail
F        Pass

为了做出整体判断,我们查看每个项目通过/失败的次数。出现两次以上的项目 仅在(通过次数==失败次数)时才被判定为“总体通过”。曾经发生的项目无需进一步判断。

Ex输出如下:

Name    Judgement
A        Pass
X        Pass
Z        Fail
F        Pass

通知A通过,因为它有2次通过和2次失败,所以2/2 = 1 == 通过

Z失败,因为它有2个通过和1个失败,所以2/1 = 2 == 失败

我的想法:

df['Name']上进行分组,同时也加入Judgement,并简单地计算每种名称对每种判断类型的出现次数。有没有更清洁的方法可以做到这一点?这个想法似乎有点麻烦,但我能提出的就是所有这些。

5 个答案:

答案 0 :(得分:2)

这是您需要的吗? 0.5表示它们相等,1表示所有项目均合格,这两个条件产生合格

s=df.Judgement.eq('Pass').groupby(df['Name']).agg(['mean','count'])
((s['mean'].eq(1)&s['count'].le(2))|s['mean'].eq(0.5)).map({True:'Pass',False:'Fail'})
Out[436]: 
Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

答案 1 :(得分:2)

这是我的方法:

new_df = df.Judgement.eq('Pass').groupby(df['Name']).agg({'size','mean', 'max'})

is_passed = ( # check those with more than two counts
             (new_df['mean'].eq(0.5) & new_df['size'].gt(2)) 

              # those with one or two counts pass if they have a pass
             | (new_df['size'].le(2) & new_df['max'])   
            )

产生:

Name
A     True
F     True
X     True
Z    False
dtype: bool

等效地,我们可以做到:

is_passed = np.where(new_df['size'].le(2), new_df['max'] , new_df['mean'].eq(0.5))

,您可以使用np.where来掩盖passfail

np.where(is_passed, 'pass', 'fail')

答案 2 :(得分:1)

具有自定义的apply功能:

In [334]: def compare_pass_fail(x):
     ...:     v_counts = x['Judgement'].value_counts()
     ...:     return 'Pass' if ('Fail' not in v_counts or v_counts.get('Pass') == v_counts['Fail']) else 'Fail'
     ...: 
In [335]: df.groupby('Name').apply(compare_pass_fail)
Out[335]: 
Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

答案 3 :(得分:1)

我使用了pandas groupby apply功能。逻辑可能会有所不同,但适用于您的情况。

   df = pd.DataFrame({"Name": ["A","A","A","A","X","X","Z","Z","Z","F"], "Judgement" : ["Pass","Fail","Fail","Pass","Pass","Pass","Pass","Pass","Fail","Pass"]})   



  Name  Judgement
0   A   Pass
1   A   Fail
2   A   Fail
3   A   Pass
4   X   Pass
5   X   Pass
6   Z   Pass
7   Z   Pass
8   Z   Fail
9   F   Pass

def func(x):
    np = len(x[x["Judgement"] == "Pass"])
    nf = len(x[x["Judgement"] == "Fail"])
    if(np*nf == 0):
        return x["Judgement"].unique()[0]
    else:
        if(np!=nf):
            return "Fail"
        else:
            return "Pass"
df.groupby("Name").apply(func)

Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

答案 4 :(得分:0)

您还可以首先通过失败计数生成DataFrame并进行处理:

df_count= df.groupby(['Name', 'Judgement']).apply(len).unstack(-1).fillna(0)

然后处理它的列:

((df_count['Fail'] == df_count['Pass']) | ((df_count['Fail'] == 0) & (df_count['Pass'].le(2)))).map({True: 'Pass', False: 'Fail'})

总体结果是:

Name
A    Pass
F    Pass
X    Pass
Z    Fail
dtype: object

df_count可用于检查结果,看起来像这样:

Judgement  Fail  Pass
Name                 
A           2.0   2.0
F           0.0   1.0
X           0.0   2.0
Z           1.0   2.0