我有像这样的pandas电影的数据框
id, name, genre, release_year
1 A [a,b,c] 2017
2 B [b,c] 2017
3 C [a,c] 2010
4 D [d,c] 2010
....
我想根据流派列表中的值来分组电影。 我的预期输出是:
year, genre, number_of_movies
2017 a 1
2017 b 2
2017 c 2
2010 a 1
2010 c 2
...
有人可以帮助我实现这个目标吗?
答案 0 :(得分:1)
您可以按照构造函数创建新的DataFrame
,按stack
重新塑造,并使用groupby
计算使用size
:
df1 = (pd.DataFrame(df['genre'].values.tolist(), index=df['release_year'].values)
.stack()
.reset_index(name='genre')
.groupby(['release_year','genre'])
.size()
.reset_index(name='number_of_movies'))
print (df1)
release_year genre number_of_movies
0 2010 a 1
1 2010 c 2
2 2010 d 1
3 2017 a 1
4 2017 b 2
5 2017 c 2
答案 1 :(得分:1)
要提高效果,请使用itertools.chain
展平genre
列:
from itertools import chain
df = pd.DataFrame({
'genre' : list(
chain.from_iterable(df.genre.tolist())
),
'release_year' : df.release_year.repeat(df.genre.str.len())
})
df
genre release_year
0 a 2017
0 b 2017
0 c 2017
1 b 2017
1 c 2017
2 a 2010
2 c 2010
3 d 2010
3 c 2010
现在,在genre
和release_year
上进行分组,找到每个群组的size
:
df.groupby(
['genre', 'release_year'], sort=False
).size()\
.reset_index(name='number_of_movies')
genre release_year number_of_movies
0 a 2017 1
1 b 2017 2
2 c 2017 2
3 a 2010 1
4 c 2010 2
5 d 2010 1
答案 2 :(得分:1)
另一种很酷的方法是使用Counter
即
from collections import Counter
ndf = df.groupby('release_year')['genre'].apply(lambda x : Counter(np.concatenate(x.values))).reset_index()
ndf = ndf.set_axis('release_year,genre,number_of_movies'.split(','),inplace=False,axis=1)
输出:
release_year genre number_of_movies
0 2010 a 1.0
1 2010 c 2.0
2 2010 d 1.0
3 2017 a 1.0
4 2017 b 2.0
5 2017 c 2.0
答案 3 :(得分:0)
以下是collections.Counter
方法,其复杂度为O(n),无需df.groupby
/ df.apply
:
from collections import Counter
from itertools import product, chain
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3, 4],
'name': ['A', 'B', 'C', 'D'],
'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
'year': [2017, 2017, 2010, 2010]})
c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
for idx, x in df.iterrows()]))
# Counter({(2010, 'a'): 1,
# (2010, 'c'): 2,
# (2010, 'd'): 1,
# (2017, 'a'): 1,
# (2017, 'b'): 2,
# (2017, 'c'): 2})
df = pd.DataFrame.from_dict(c, orient='index')
# 0
# (2017, a) 1
# (2017, b) 2
# (2017, c) 2
# (2010, a) 1
# (2010, c) 2
# (2010, d) 1