获取列意味着在groupby子句python pandas中

时间:2018-06-07 18:13:00

标签: python pandas pandas-groupby

我有一个演员和导演的数据集以及他们一起合作的电影的受欢迎程度。

print (actors_director_df.head(3))

                 actor         director  popularity counter
0          Chris Pratt  Colin Trevorrow   32.985763       0
1  Bryce Dallas Howard  Colin Trevorrow   32.985763       0
2          Irrfan Khan  Colin Trevorrow   32.985763       0

我想通过使用演员和导演进行分组,因为一对可以在不止一部电影中工作。我成功地使用了以下查询。

actor_director_grouped = actors_director_df.groupby(['actor','director']) \
                         .size() \
                         .reset_index(name='count') \
                         .sort_values(['count'], ascending=False) \
                         .head(10)

print (actor_director_grouped)

                      actor            director  count
3619         Clint Eastwood      Clint Eastwood     14
19272           Woody Allen         Woody Allen     12
9606            Johnny Depp          Tim Burton      8

但是这个DF中的人气专栏没有找到。

我想要做的就是在groupby之后做一个受欢迎的平均栏,并在演员和导演面前展示他们共同制作的电影数量。

即。我理想的输出就是这样的。

                      actor            director  popularity count
3619         Clint Eastwood      Clint Eastwood   32.985763    14
19272           Woody Allen         Woody Allen   5.1231231    12
9606            Johnny Depp          Tim Burton   3.1231231    8

2 个答案:

答案 0 :(得分:3)

查看您的数据框,counter列似乎没必要。我们改为使用热门列,制作一个mean和一个count列:

import pandas as pd
import numpy as np

np.random.seed(444)

names = [
    'Robert Baratheon',
    'Jon Snow',
    'Daenerys Targaryen',
    'Theon Greyjoy',
    'Tyrion Lannister'
]

df = pd.DataFrame({
    'actor': np.random.choice(names, size=10, p = [0.2,0.2,0.2,0.1,0.3]),
    'director': np.random.choice(names, size=10, p = [0.4,0.1,0.1,0.1,0.3]),
    'popularity': np.random.randint(0,100, size=10),
    'counter': 0
})

df2 = df.groupby(['actor','director'])['popularity']\
        .agg(['count', 'mean'])\
        .reset_index()\
        .sort_values(by='mean', ascending=False)

print(df2)

返回:

              actor          director  count  mean
0          Jon Snow  Robert Baratheon      2  53.5
5  Tyrion Lannister  Tyrion Lannister      2  49.0
2  Robert Baratheon  Tyrion Lannister      2  48.5
1  Robert Baratheon          Jon Snow      2  40.5
4     Theon Greyjoy  Tyrion Lannister      1  13.0
3     Theon Greyjoy  Robert Baratheon      1   7.0

答案 1 :(得分:2)

我冒昧地添加了一些有助于更好地理解groupby子句的虚拟数据。

print(df)

输出:

                   actor           director  popularity  counter
0           Chris Pratt    Colin Trevorrow   32.985763        0
1   Bryce Dallas Howard    Colin Trevorrow   32.985763        0
2           Irrfan Khan    Colin Trevorrow   32.985763        0
3           Irrfan Khan    Colin Trevorrow   60.000000       12
4           Irrfan Khan       John Markson   10.000000       10
5           Irrfan Khan       Mark Johnson  100.000000        4

然后,您需要在groupbyactordirector然后找到mean的{​​{1}}和popularity的{​​{1}}

sum

输出:

count