假设我的DataFrame df
是这样创建的:
df = pd.DataFrame({"title" : ["Robin Hood", "Madagaskar"],
"genres" : ["Action, Adventure", "Family, Animation, Comedy"]},
columns=["title", "genres"])
它看起来像这样:
title genres
0 Robin Hood Action, Adventure
1 Madagaskar Family, Animation, Comedy
让我们假设每部电影可以有任意数量的流派。如何将DataFrame扩展为
title genre
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
?
答案 0 :(得分:6)
In [33]: (df.set_index('title')
['genres'].str.split(',\s*', expand=True)
.stack()
.reset_index(name='genre')
.drop('level_1',1))
Out[33]:
title genre
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
PS here你可以找到更通用的方法。
答案 1 :(得分:4)
您可以np.repeat
使用numpy.concatenate
进行展平。
splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()
df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
title genres
0 Robin Hood Action
1 Robin Hood Adventure
2 Madagaskar Family
3 Madagaskar Animation
4 Madagaskar Comedy
<强>计时强>:
df = pd.concat([df]*100000).reset_index(drop=True)
In [95]: %%timeit
...: splitted = df['genres'].str.split(',\s*')
...: l = splitted.str.len()
...:
...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
...: 'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
...:
...:
1 loop, best of 3: 709 ms per loop
In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop
答案 2 :(得分:1)
自pandas >= 0.25.0
起,我们有一个称为explode
的本地方法。
此方法将列表中的每个元素取消嵌套到新行,并重复其他列。
因此,首先我们必须在字符串值上调用Series.str.split
,以将字符串拆分为元素列表。
>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')
title genres
0 Robin Hood Action
0 Robin Hood Adventure
1 Madagaskar Family
1 Madagaskar Animation
1 Madagaskar Comedy