如何将字符串值拆分/扩展为几个pandas DataFrame行?

时间:2017-11-30 10:51:10

标签: pandas

假设我的DataFrame df是这样创建的:

df = pd.DataFrame({"title" : ["Robin Hood", "Madagaskar"],
                  "genres" : ["Action, Adventure", "Family, Animation, Comedy"]},
                 columns=["title", "genres"])

它看起来像这样:

        title                     genres
0  Robin Hood          Action, Adventure
1  Madagaskar  Family, Animation, Comedy

让我们假设每部电影可以有任意数量的流派。如何将DataFrame扩展为

        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy

3 个答案:

答案 0 :(得分:6)

In [33]: (df.set_index('title')
            ['genres'].str.split(',\s*', expand=True)
            .stack()
            .reset_index(name='genre')
            .drop('level_1',1))
Out[33]:
        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy

PS here你可以找到更通用的方法。

答案 1 :(得分:4)

您可以np.repeat使用numpy.concatenate进行展平。

splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()

df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
                     'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
        title      genres
0  Robin Hood      Action
1  Robin Hood   Adventure
2  Madagaskar      Family
3  Madagaskar   Animation
4  Madagaskar      Comedy

<强>计时

df = pd.concat([df]*100000).reset_index(drop=True)

In [95]: %%timeit
    ...: splitted = df['genres'].str.split(',\s*')
    ...: l = splitted.str.len()
    ...: 
    ...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
    ...:                      'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
    ...: 
    ...: 
1 loop, best of 3: 709 ms per loop

In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop

答案 2 :(得分:1)

pandas >= 0.25.0起,我们有一个称为explode的本地方法。

此方法将列表中的每个元素取消嵌套到新行,并重复其他列。

因此,首先我们必须在字符串值上调用Series.str.split,以将字符串拆分为元素列表。

>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')

        title     genres
0  Robin Hood     Action
0  Robin Hood  Adventure
1  Madagaskar     Family
1  Madagaskar  Animation
1  Madagaskar     Comedy