如何在单个列上使用groupby并在Pandas中对多个列执行比较?

时间:2018-12-10 17:41:10

标签: python pandas lambda apply pandas-groupby

我有一个用户数据框,无论他们是否已注册,以及该模型对他们是否已注册的预测。我想为每个用户查找:TP(他们注册并模型预测他们确实同意),FP(他们未注册但模型预测他们确实已签名),FN(他们注册但模型预测他们否定),以及TN(他们没有注册,并且模型预测为否)。这里1表示他们注册,0表示他们没有注册。我想对用户进行分组,然后使用其他两列进行比较。例如,我可能会有以下内容:

Users    |    Signed_up    |     Prediction   |
User1         1                  0            
User2         0                  0
User1         1                  1
User3         1                  1
User2         0                  1
User2         0                  0
...

For TP, the resulting table might look something like:

Users    |    TP    |
User1         1
User2         0
User3         1

For TN, the resulting table might look something like:
Users    |    TN    |
User1         0
User2         1
User3         0

and so on for FP and FN.

我假设我在Users列上进行分组,并使用lambda函数比较Sign_upPrediction列,但是我不确定如何实际执行此操作。我将不胜感激!

3 个答案:

答案 0 :(得分:4)

先进行比较,然后groupby,然后groupby + sum

(df.assign(TP = df.Signed_up & df.Prediction, 
           TN = (df.Signed_up == 0) & (df.Prediction == 0),
           FN = df.Signed_up & (df.Prediction == 0), 
           FP = (df.Signed_up == 0) & df.Prediction)
   .groupby('Users')['TP', 'TN', 'FN', 'FP'].sum())

       TP   TN   FN   FP
Users                   
User1   1  0.0  1.0  0.0
User2   0  2.0  0.0  1.0
User3   1  0.0  0.0  0.0

受@BrianJoseph的启发,您只需键入更少的内容,就可以groupby全部3列,确定大小并拆开除用户以外的所有内容:

df.groupby([*df]).size().unstack([1,2]).fillna(0)

Signed_up     1         0     
Prediction    0    1    0    1
Users                         
User1       1.0  1.0  0.0  0.0
User2       0.0  0.0  2.0  1.0
User3       0.0  1.0  0.0  0.0

答案 1 :(得分:3)

请记住,熊猫可以使用函数结果进行分组。为了区分这4类结果,您只需要了解Signed_upPrediction之间的关系。您可以像这样对它们进行分类:

grps = df.groupby(lambda index: (df.loc[index, 'Signed_up'], df.loc[index, 'Prediction']))

这只是给您groupby对象,您可以随意命名组,例如:

tp_df = grps.get_group((1,1))

答案 2 :(得分:2)

如果创建不同的dfs,则对于您的帖子中的每个模型预测,都可以使用布尔掩码和&按位运算符来进行。 &表示必须同时满足两个条件才能返回值,所以:

df = pd.read_csv('./Desktop/models.csv')

TP = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 1)]

TN = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 0)]

FN = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 0)]

FP = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 1)]

输出:

>>> TP
   Users  Signed_up  Prediction
2  User1          1           1
3  User3          1           1
>>> TN = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 0)]
>>> TN
   Users  Signed_up  Prediction
1  User2          0           0
5  User2          0           0
>>> FN = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 0)]
>>> FN
   Users  Signed_up  Prediction
0  User1          1           0
>>> FP = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 1)]
>>> FP
   Users  Signed_up  Prediction
4  User2          0           1