am试图结合数据框中的特征以在数据框中导出新列
我有这个数据框
Id Author News_post Label
1 Jessica xxxxxxxxx 1
2 Adams xxxxxxxxx 1
3 Adams xxxxxxxxx 1
4 Mike xxxxxxxxx 0
5 James xxxxxxxxx 1
6 Mike xxxxxxxxx 1
7 Mike xxxxxxxxx 0
8 Paul xxxxxxxxx 0
9 Jessica xxxxxxxxx 0
10 Adams xxxxxxxxx 0
注意:Label
列中有1=TRUE
和0=FALSE
Id Author Num_Post Num_True_Label Num_False_Label Mean
1 Adams 3 2 1 x
2 James 1 1 0 x
3 Jessica 2 1 1 x
4 Mike 2 0 1 x
5 Paul 1 0 0 x
答案 0 :(得分:1)
这可以解决您试图从问题中解决的许多问题:
df = pd.read_clipboard() # just copied your dataframe
df = df.groupby('Author').describe()
输出:
Id Label
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Author
Adams 3.0 5.000000 4.358899 2.0 2.5 3.0 6.5 10.0 3.0 0.666667 0.577350 0.0 0.50 1.0 1.00 1.0
James 1.0 5.000000 NaN 5.0 5.0 5.0 5.0 5.0 1.0 1.000000 NaN 1.0 1.00 1.0 1.00 1.0
Jessica 2.0 5.000000 5.656854 1.0 3.0 5.0 7.0 9.0 2.0 0.500000 0.707107 0.0 0.25 0.5 0.75 1.0
Mike 3.0 5.666667 1.527525 4.0 5.0 6.0 6.5 7.0 3.0 0.333333 0.577350 0.0 0.00 0.0 0.50 1.0
Paul 1.0 8.000000 NaN 8.0 8.0 8.0 8.0 8.0 1.0 0.000000 NaN 0.0 0.00 0.0 0.00 0.0
答案 1 :(得分:1)
以下内容将为您提供所需的东西:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})
In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()
In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()
In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})
In [6]: df['Id'] = df['Id'] + 1
In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']
In [8]: df
Out[7]:
Id Author Num_Post Num_True_Label Num_False_Label
0 1 Adams 3 2 1
1 2 James 1 1 0
2 3 Jessica 2 1 1
3 4 Mike 3 1 2
4 5 Paul 1 0 1
请进一步指定您的Mean
列应代表什么。
一些可能有用的资源: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
答案 2 :(得分:1)
将熊猫0.25与aggregation relabeling一起使用
df.groupby('Author')['Label'].agg(Num_Post = 'size',
Num_True = 'sum',
Num_False = lambda x: x.eq(0).sum(),
Mean = 'mean')
输出:
Num_Post Num_True Num_False Mean
Author
Adams 3 2 1 0.666667
James 1 1 0 1.000000
Jessica 2 1 1 0.500000
Mike 3 1 2 0.333333
Paul 1 0 1 0.000000
答案 3 :(得分:0)
使用transform,然后删除重复项,以便:
df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']
最后:放下公仔并删除News_post
df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])
结果:
Id Author Label Num_Post Num_True_Label Num_False_Label Mean
1 2 Adams 1 3 2 1 1.500000
4 5 James 1 1 1 0 1.000000
0 1 Jessica 1 2 1 1 2.000000
3 4 Mike 0 3 1 2 3.000000
7 8 Paul 0 1 0 1 inf
注意:更改定义的平均值。
答案 4 :(得分:0)
您可以尝试:
agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' :
lambda x : x.eq(1).sum(),
'Num_False_Label':lambda x :
x.eq(0).sum(),
'Mean':'mean'}).reset_index()