Question

我在电影df IMDB上遇到了麻烦。

清理后的数据帧看起来像这样。

   popularity     budget     revenue      original_title  \
0   32.985763  150000000  1513528810      Jurassic World   
1   28.419936  150000000   378436354  Mad Max: Fury Road   
2   13.112507  110000000   295238201           Insurgent   

                                            cast            director  \
0  [Chris Pratt, Bryce Dallas Howard, Irrfan Khan...   [Colin Trevorrow]   
1  [Tom Hardy, Charlize Theron, Hugh Keays-Byrne,...     [George Miller]   
2  [Shailene Woodley, Theo James, Kate Winslet, A...  [Robert Schwentke]   

                                        overview  runtime  \
0  Twenty-two years after the events of Jurassic ...      124   
1  An apocalyptic story set in the furthest reach...      120   
2  Beatrice Prior must confront her inner demons ...      119   

                                       genres release_date  vote_count  \
0  [Action, Adventure, Science Fiction, Thriller]   2015-06-09        5562   
1  [Action, Adventure, Science Fiction, Thriller]   2015-05-13        6185   
2          [Adventure, Science Fiction, Thriller]   2015-03-18        2480   

vote_average  release_year    budget_adj   revenue_adj  
0           6.5          2015  1.379999e+08  1.392446e+09  
1           7.1          2015  1.379999e+08  3.481613e+08  
2           6.3          2015  1.012000e+08  2.716190e+08

“类型”列转换为每个条目的元素列表

目标是按年份对每种类型的计数进行分组。

类似这样的东西。

              index  count  year
0            Action    106  2015
1         Adventure     69  2015
2   Science Fiction     84  2015
3          Thriller    171  2015
4           Fantasy     33  2015
5             Crime     51  2015
6           Western      6  2015
7             Drama    260  2015
8            Family     44  2015
9         Animation     37  2015
10           Comedy    160  2015
11          Mystery     42  2015
12          Romance     56  2015
13              War      9  2015
14          History     15  2015
15            Music     33  2015
16           Horror    125  2015
17      Documentary     51  2015
18         TV Movie     20  2015

要达到此目的，方法是：

df_year = df[df.release_date.dt.year == 2015]
list_flat = functools.reduce(operator.iconcat,list(df_year.genres.values), [])
df_years = pd.DataFrame(dict(Counter(list_flat)),range(1)).T
df_years['year'] = 2015
df_years.rename(columns={0:'count'},inplace=True)
df_years.reset_index(inplace=True)

但是我似乎无法在for循环中实现这一点，以便在所有年份中都做到这一点

df_years.append(df_years_temp,sort=False).reset_index(inplace=True)

我尝试将temp df附加到上面的主df上，但它返回相同的df，没有任何更改，并且未附加任何内容

这样做可以直观显示随着时间流逝的流派变化。

欢迎任何建议。

Answer 1

只需将列表.explode .groupby分成更多行，然后使用.transform('count)和df = pd.DataFrame({'index': {0: ['Action', ' Adventure', ' Science Fiction', ' Thriller'], 1: ['Action', ' Adventure', ' Science Fiction', ' Thriller'], 2: ['Adventure', ' Science Fiction', ' Thriller']}, 'year': {0: 2015, 1: 2015, 2: 2015}})用计数创建一个新列：

输入：

df = df.explode('index')
df['count'] = df.groupby('index')['index'].transform('count')
df

代码：

    index           year    count
0   Action          2015    2
0   Adventure       2015    2
0   Science Fiction 2015    3
0   Thriller        2015    3
1   Action          2015    2
1   Adventure       2015    2
1   Science Fiction 2015    3
1   Thriller        2015    3
2   Adventure       2015    1
2   Science Fiction 2015    3
2   Thriller        2015    3

输出：

df_goals = df_goals.dropna(axis=0)

在带有特定列的for循环中将df附加到另一个df？

1 个答案: