我有一个数据框:
title | cast
------------------------------
movie1 | cast1,cast2,cast3
movie2 | cast4,cast1,cast6,cast7
movie3 | cast4,cast3,cast5
pd.DataFrame({'movie': ['movie1','movie2','movie3'], 'cast': ['cast1,cast2,cast3','cast4,cast1,cast6,cast7','cast4,cast3,cast5']})
所以,我想得到的结果是:
cast | count
------------------------------
cast1 | 5
cast2 | 2
cast3 | 4
cast4 | 5
cast5 | 2
cast6 | 3
cast7 | 3
为此,
df_cast = df.join(df.cast
.str.strip(',')
.str.split(',',expand=True)
.stack()
.reset_index(level=1,drop=True)
.rename('cast_member')).reset_index(drop=True)
这将添加一个新列cast_member
,其中每个单元格中只有一个转换成员名称。我尝试使用groupby('cast_member')
,但是我不确定之后如何进行。
我是熊猫的新手,所以即使答案很简单,我也非常感谢。
答案 0 :(得分:3)
将GroupBy.transform
用于新的列,并首先按movie
进行计数:
df_cast['cast_count'] = df_cast.groupby('movie')['movie'].transform('size')
print (df_cast)
movie cast cast_member cast_count
0 movie1 cast1,cast2,cast3 cast1 3
1 movie1 cast1,cast2,cast3 cast2 3
2 movie1 cast1,cast2,cast3 cast3 3
3 movie2 cast4,cast1,cast6,cast7 cast4 4
4 movie2 cast4,cast1,cast6,cast7 cast1 4
5 movie2 cast4,cast1,cast6,cast7 cast6 4
6 movie2 cast4,cast1,cast6,cast7 cast7 4
7 movie3 cast4,cast3,cast5 cast4 3
8 movie3 cast4,cast3,cast5 cast3 3
9 movie3 cast4,cast3,cast5 cast5 3
然后将size
与cast_count
中的sum
相加,并减去最后的count
:
df = df_cast.groupby('cast_member')['cast_count'].agg(['size','sum'])
df1 = df['sum'].sub(df['size']).rename('count').reset_index()
print (df1)
cast_member count
0 cast1 5
1 cast2 2
2 cast3 4
3 cast4 5
4 cast5 2
5 cast6 3
6 cast7 3