将Pandas DataFrame列列表值拆分为重复的行

时间:2019-08-22 21:53:00

标签: python-3.x pandas dataframe

我有一个如下数据框:

awk -vfileout="$fileout" 'BEGIN {'"$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')"'} $2 $3 in a { print $0 > fileout $2 "_" $3 "_Output.txt" }' "$datadir"/*.txt

我想做的是获取“作者”列,并通过复制所有其他列将列表中的列表分成几行,我还想将结果存储在名为“作者”的新列中并保留原始列。

以下内容准确描述了我想要实现的目标:

publication_title    authors                             type ...
title 1              ['author1', 'author2', 'author3']   proceedings
title 2              ['author4', 'author5']              collections
title 3              ['author6', 'author7']              books
.
.
. 

我尝试使用pandas DataFrame explode方法实现此目的,但是我找不到将结果存储在新列中的方法。

感谢您的帮助。

3 个答案:

答案 0 :(得分:1)

pandas 0.25.0开始,我们有了explode方法。首先,我们复制authors列并使用assign同时对其重命名,然后将这一列分解为行并复制其他列:

df.assign(author=df['authors']).explode('author')

输出

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
0           title_1  [author1, author2, author3]  proceedings  author2
0           title_1  [author1, author2, author3]  proceedings  author3
1           title_2           [author4, author5]  collections  author4
1           title_2           [author4, author5]  collections  author5
2           title_3           [author6, author7]        books  author6
2           title_3           [author6, author7]        books  author7

如果要删除重复的索引,请使用reset_index

df.assign(author=df['authors']).explode('author').reset_index(drop=True)

输出

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
1           title_1  [author1, author2, author3]  proceedings  author2
2           title_1  [author1, author2, author3]  proceedings  author3
3           title_2           [author4, author5]  collections  author4
4           title_2           [author4, author5]  collections  author5
5           title_3           [author6, author7]        books  author6
6           title_3           [author6, author7]        books  author7

答案 1 :(得分:0)

您可以先与作者创建一个新的DataFrame

df2 = pd.DataFrame(df['author'].tolist(), index=df.index).stack()

接下来,我们删除第二级索引:

df2.index = df2.index.droplevel(1)

接下来,我们可以在第二个轴上串联:

>>> pd.concat([df, df2], axis=1)
     title                       author         type        0
0  title 1  [author1, author2, author3]  proceedings  author1
0  title 1  [author1, author2, author3]  proceedings  author2
0  title 1  [author1, author2, author3]  proceedings  author3
1  title 2           [author4, author5]  collections  author4
1  title 2           [author4, author5]  collections  author5
2  title 3           [author6, author7]        books  author6
2  title 3           [author6, author7]        books  author7

或带有一个衬里:

>>> pd.concat([df, pd.DataFrame(df['author'].tolist(), index=df.index).stack().reset_index(level=1, drop=True)], axis=1)
     title                       author         type        0
0  title 1  [author1, author2, author3]  proceedings  author1
0  title 1  [author1, author2, author3]  proceedings  author2
0  title 1  [author1, author2, author3]  proceedings  author3
1  title 2           [author4, author5]  collections  author4
1  title 2           [author4, author5]  collections  author5
2  title 3           [author6, author7]        books  author6
2  title 3           [author6, author7]        books  author7

答案 2 :(得分:0)

您已经发现explode,这意味着您快到了!只需合并原始数据和爆炸数据,请参见下面的代码

# data
df = pd.DataFrame({'publication_title':['title_1','title_2','title_3'],
              'authors':[['author1', 'author2', 'author3'],['author4', 'author5'],['author6', 'author7']],
              'type':['proceedings','collections','books']})
(df.explode(column='authors')
   .rename(columns={'authors':'author'})
   .merge(df))