某些条件下的熊猫数据框

时间:2019-07-27 21:32:25

标签: python pandas

am试图结合数据框中的特征以在数据框中导出新列

我有这个数据框

Id   Author   News_post  Label
1    Jessica  xxxxxxxxx  1
2    Adams    xxxxxxxxx  1
3    Adams    xxxxxxxxx  1
4    Mike     xxxxxxxxx  0
5    James    xxxxxxxxx  1
6    Mike     xxxxxxxxx  1
7    Mike     xxxxxxxxx  0
8    Paul     xxxxxxxxx  0
9    Jessica  xxxxxxxxx  0
10   Adams    xxxxxxxxx  0

注意:Label列中有1=TRUE0=FALSE

Id   Author   Num_Post  Num_True_Label  Num_False_Label   Mean
1    Adams    3         2               1                 x
2    James    1         1               0                 x
3    Jessica  2         1               1                 x
4    Mike     2         0               1                 x
5    Paul     1         0               0                 x

5 个答案:

答案 0 :(得分:1)

这可以解决您试图从问题中解决的许多问题:

df = pd.read_clipboard()  # just copied your dataframe
df = df.groupby('Author').describe()

输出:

           Id                                               Label                                               
        count      mean       std  min  25%  50%  75%   max count      mean       std  min   25%  50%   75%  max
Author                                                                                                          
Adams     3.0  5.000000  4.358899  2.0  2.5  3.0  6.5  10.0   3.0  0.666667  0.577350  0.0  0.50  1.0  1.00  1.0
James     1.0  5.000000       NaN  5.0  5.0  5.0  5.0   5.0   1.0  1.000000       NaN  1.0  1.00  1.0  1.00  1.0
Jessica   2.0  5.000000  5.656854  1.0  3.0  5.0  7.0   9.0   2.0  0.500000  0.707107  0.0  0.25  0.5  0.75  1.0
Mike      3.0  5.666667  1.527525  4.0  5.0  6.0  6.5   7.0   3.0  0.333333  0.577350  0.0  0.00  0.0  0.50  1.0
Paul      1.0  8.000000       NaN  8.0  8.0  8.0  8.0   8.0   1.0  0.000000       NaN  0.0  0.00  0.0  0.00  0.0

答案 1 :(得分:1)

以下内容将为您提供所需的东西:

In [1]: import pandas as pd                                                                                                                                                                                                                  

In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
    ...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})                                                                                                                                                                

In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()                                                                                                                        

In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()                                                                                                                

In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})

In [6]: df['Id'] = df['Id'] + 1

In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']

In [8]: df                                                                                                                                                                                                                                
Out[7]: 
   Id   Author  Num_Post  Num_True_Label  Num_False_Label
0   1    Adams         3               2                1
1   2    James         1               1                0
2   3  Jessica         2               1                1
3   4     Mike         3               1                2
4   5     Paul         1               0                1


请进一步指定您的Mean列应代表什么。

一些可能有用的资源: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

答案 2 :(得分:1)

将熊猫0.25与aggregation relabeling一起使用

df.groupby('Author')['Label'].agg(Num_Post = 'size',
                                  Num_True = 'sum',
                                  Num_False = lambda x: x.eq(0).sum(),
                                  Mean = 'mean')

输出:

         Num_Post  Num_True  Num_False      Mean
Author                                          
Adams           3         2          1  0.666667
James           1         1          0  1.000000
Jessica         2         1          1  0.500000
Mike            3         1          2  0.333333
Paul            1         0          1  0.000000

答案 3 :(得分:0)

使用transform,然后删除重复项,以便:

df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']

最后:放下公仔并删除News_post

df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])

结果:

    Id  Author      Label   Num_Post    Num_True_Label  Num_False_Label Mean
    1   2   Adams       1       3           2               1               1.500000
    4   5   James       1       1           1               0               1.000000
    0   1   Jessica     1       2           1               1               2.000000
    3   4   Mike        0       3           1               2               3.000000
    7   8   Paul        0       1           0               1               inf

注意:更改定义的平均值。

答案 4 :(得分:0)

您可以尝试:

agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' : 
                                             lambda x : x.eq(1).sum(), 
                                            'Num_False_Label':lambda x : 
                                            x.eq(0).sum(), 
                                            'Mean':'mean'}).reset_index()