这是我的数据框(其中authors列中的值是以逗号分隔的字符串):
authors book
Jim, Charles The Greatest Book in the World
Jim An OK book
Charlotte A book about books
Charlotte, Jim The last book
如何将其转换为长格式,如下所示:
authors book
Jim The Greatest Book in the World
Jim An OK book
Jim The last book
Charles The Greatest Book in the World
Charlotte A book about books
Charlotte The last book
我尝试将各个作者提取到列表authors = list(df['authors'].str.split(','))
,展平该列表,将每个作者与每本书匹配,并在每次匹配时构建新的词典列表。但这对我来说似乎并不是pythonic,我猜猜大熊猫有更清洁的方法来做到这一点。
答案 0 :(得分:6)
您可以在为图书设置索引之后逐列拆分作者,这样几乎可以让您完成所有工作。重命名和排序列以完成。
df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')
book 0
0 The Greatest Book in the World Jim
1 The Greatest Book in the World Charles
0 An OK book Jim
0 A book about books Charlotte
0 The last book Charlotte
1 The last book Jim
让你一路回家
df.set_index('book')\
.authors.str.split(',', expand=True)\
.stack()\
.reset_index('book')\
.rename(columns={0:'authors'})\
.sort_values('authors')[['authors', 'book']]\
.reset_index(drop=True)
答案 1 :(得分:2)
.str.split
,然后.explode
列表
', '
上分割,否则逗号后面的值将以空格开头(例如' Charles'
)import pandas as pd
data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}
df = pd.DataFrame(data)
# display(df)
authors book
0 Jim, Charles The Greatest Book in the World
1 Jim An OK book
2 Charlotte A book about books
3 Charlotte, Jim The last book
# split authors
df.authors = df.authors.str.split(', ')
# explode the column
df = df.explode('authors').reset_index(drop=True)
# display(df)
authors book
0 Jim The Greatest Book in the World
1 Charles The Greatest Book in the World
2 Jim An OK book
3 Charlotte A book about books
4 Charlotte The last book
5 Jim The last book