我有一个如下数据框:
awk -vfileout="$fileout" 'BEGIN {'"$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')"'} $2 $3 in a { print $0 > fileout $2 "_" $3 "_Output.txt" }' "$datadir"/*.txt
我想做的是获取“作者”列,并通过复制所有其他列将列表中的列表分成几行,我还想将结果存储在名为“作者”的新列中并保留原始列。
以下内容准确描述了我想要实现的目标:
publication_title authors type ...
title 1 ['author1', 'author2', 'author3'] proceedings
title 2 ['author4', 'author5'] collections
title 3 ['author6', 'author7'] books
.
.
.
我尝试使用pandas DataFrame explode方法实现此目的,但是我找不到将结果存储在新列中的方法。
感谢您的帮助。
答案 0 :(得分:1)
从pandas 0.25.0
开始,我们有了explode
方法。首先,我们复制authors
列并使用assign
同时对其重命名,然后将这一列分解为行并复制其他列:
df.assign(author=df['authors']).explode('author')
输出
publication_title authors type author
0 title_1 [author1, author2, author3] proceedings author1
0 title_1 [author1, author2, author3] proceedings author2
0 title_1 [author1, author2, author3] proceedings author3
1 title_2 [author4, author5] collections author4
1 title_2 [author4, author5] collections author5
2 title_3 [author6, author7] books author6
2 title_3 [author6, author7] books author7
如果要删除重复的索引,请使用reset_index
:
df.assign(author=df['authors']).explode('author').reset_index(drop=True)
输出
publication_title authors type author
0 title_1 [author1, author2, author3] proceedings author1
1 title_1 [author1, author2, author3] proceedings author2
2 title_1 [author1, author2, author3] proceedings author3
3 title_2 [author4, author5] collections author4
4 title_2 [author4, author5] collections author5
5 title_3 [author6, author7] books author6
6 title_3 [author6, author7] books author7
答案 1 :(得分:0)
您可以先与作者创建一个新的DataFrame
:
df2 = pd.DataFrame(df['author'].tolist(), index=df.index).stack()
接下来,我们删除第二级索引:
df2.index = df2.index.droplevel(1)
接下来,我们可以在第二个轴上串联:
>>> pd.concat([df, df2], axis=1)
title author type 0
0 title 1 [author1, author2, author3] proceedings author1
0 title 1 [author1, author2, author3] proceedings author2
0 title 1 [author1, author2, author3] proceedings author3
1 title 2 [author4, author5] collections author4
1 title 2 [author4, author5] collections author5
2 title 3 [author6, author7] books author6
2 title 3 [author6, author7] books author7
或带有一个衬里:
>>> pd.concat([df, pd.DataFrame(df['author'].tolist(), index=df.index).stack().reset_index(level=1, drop=True)], axis=1)
title author type 0
0 title 1 [author1, author2, author3] proceedings author1
0 title 1 [author1, author2, author3] proceedings author2
0 title 1 [author1, author2, author3] proceedings author3
1 title 2 [author4, author5] collections author4
1 title 2 [author4, author5] collections author5
2 title 3 [author6, author7] books author6
2 title 3 [author6, author7] books author7
答案 2 :(得分:0)
您已经发现explode
,这意味着您快到了!只需合并原始数据和爆炸数据,请参见下面的代码
# data
df = pd.DataFrame({'publication_title':['title_1','title_2','title_3'],
'authors':[['author1', 'author2', 'author3'],['author4', 'author5'],['author6', 'author7']],
'type':['proceedings','collections','books']})
(df.explode(column='authors')
.rename(columns={'authors':'author'})
.merge(df))