将逗号分隔值提取到pandas中的各个行

时间:2016-12-20 14:45:43

标签: python-3.x pandas

这是我的数据框(其中authors列中的值是以逗号分隔的字符串):

authors            book

Jim, Charles       The Greatest Book in the World
Jim                An OK book
Charlotte          A book about books
Charlotte, Jim     The last book

如何将其转换为长格式,如下所示:

authors            book

Jim                The Greatest Book in the World
Jim                An OK book
Jim                The last book
Charles            The Greatest Book in the World
Charlotte          A book about books
Charlotte          The last book

我尝试将各个作者提取到列表authors = list(df['authors'].str.split(',')),展平该列表,将每个作者与每本书匹配,并在每次匹配时构建新的词典列表。但这对我来说似乎并不是pythonic,我猜猜大熊猫有更清洁的方法来做到这一点。

2 个答案:

答案 0 :(得分:6)

您可以在为图书设置索引之后逐列拆分作者,这样几乎可以让您完成所有工作。重命名和排序列以完成。

df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')

                             book          0
0  The Greatest Book in the World        Jim
1  The Greatest Book in the World    Charles
0                      An OK book        Jim
0              A book about books  Charlotte
0                   The last book  Charlotte
1                   The last book        Jim

让你一路回家

df.set_index('book')\
  .authors.str.split(',', expand=True)\
  .stack()\
  .reset_index('book')\
  .rename(columns={0:'authors'})\
  .sort_values('authors')[['authors', 'book']]\
  .reset_index(drop=True)

答案 1 :(得分:2)

  • 最好的选择是使用.str.split,然后.explode列表
    • ', '上分割,否则逗号后面的值将以空格开头(例如' Charles'
import pandas as pd

data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}

df = pd.DataFrame(data)

# display(df)
          authors                            book
0    Jim, Charles  The Greatest Book in the World
1             Jim                      An OK book
2       Charlotte              A book about books
3  Charlotte, Jim                   The last book

# split authors
df.authors = df.authors.str.split(', ')

# explode the column
df = df.explode('authors').reset_index(drop=True)

# display(df)
     authors                            book
0        Jim  The Greatest Book in the World
1    Charles  The Greatest Book in the World
2        Jim                      An OK book
3  Charlotte              A book about books
4  Charlotte                   The last book
5        Jim                   The last book