将列表拆分为pandas DataFrame中的多个列

时间:2017-12-14 16:43:38

标签: python pandas dataframe pivot multiple-columns

我有一个源系统,它给我这样的数据:

Name    |Hobbies
----------------------------------
"Han"   |"Art;Soccer;Writing"
"Leia"  |"Art;Baking;Golf;Singing"
"Luke"  |"Baking;Writing"

每个爱好列表都以分号分隔。我想把它变成一个像结构的表格,每个爱好都有一个列,还有一个标志,表明一个人是否选择了这个爱好:

Name    |Art     |Baking  |Golf    |Singing |Soccer  |Writing  
--------------------------------------------------------------
"Han"   |1       |0       |0       |0       |1       |1
"Leia"  |1       |1       |1       |1       |0       |0
"Luke"  |0       |1       |0       |0       |0       |1

以下是在pandas数据帧中生成样本数据的代码:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [
...         {'name': 'Han',   'hobbies': 'Art;Soccer;Writing'},
...         {'name': 'Leia',  'hobbies': 'Art;Baking;Golf;Singing'},
...         {'name': 'Luke',  'hobbies': 'Baking;Writing'},
...     ]
... )
>>> df
                   hobbies  name
0       Art;Soccer;Writing   Han
1  Art;Baking;Golf;Singing  Leia
2           Baking;Writing  Luke

现在,我正在使用以下代码将数据转换为具有我想要的结构的数据框,但它真的慢(我的实际数据集大约有150万行) :

>>> df2 = pd.DataFrame(columns=['name', 'hobby'])
>>>
>>> for index, row in df.iterrows():
...     for value in str(row['hobbies']).split(';'):
...         d = {'name':row['name'], 'value':value}
...         df2 = df2.append(d, ignore_index=True)
...
>>> df2 = df2.groupby('name')['value'].value_counts()
>>> df2 = df2.unstack(level=-1).fillna(0)
>>>
>>> df2
value  Art  Baking  Golf  Singing  Soccer  Writing
name
Han    1.0     0.0   0.0      0.0     1.0      1.0
Leia   1.0     1.0   1.0      1.0     0.0      0.0
Luke   0.0     1.0   0.0      0.0     0.0      1.0

有更有效的方法吗?

3 个答案:

答案 0 :(得分:1)

你可以做的不是在每次迭代时附加列,而是在运行循环后附加所有列:

df3 = pd.DataFrame(columns=['name', 'hobby'])
d_list = []

for index, row in df.iterrows():
    for value in str(row['hobbies']).split(';'):
        d_list.append({'name':row['name'], 
                       'value':value})
df3 = df3.append(d_list, ignore_index=True)
df3 = df3.groupby('name')['value'].value_counts()
df3 = df3.unstack(level=-1).fillna(0)
df3

我检查了示例数据帧需要多长时间。随着改进,我建议它快〜50倍。

答案 1 :(得分:1)

为什么不直接更改DataFrame?

for idx, row in df.iterrows():
    for hobby in row.hobbies.split(";"):
        df.loc[idx, hobby] = True

df.fillna(False, inplace=True)

答案 2 :(得分:0)

实际上,使用.str.split.melt的速度要比循环使用iterrows的速度要快一些。

  1. 拆分为多列:

    >>> df = pd.DataFrame([{'name': 'Han', 'hobbies': 'Art;Soccer;Writing'}, 
                           {'name': 'Leia', 'hobbies': 'Art;Baking;Golf;Singing'},
                           {'name': 'Luke', 'hobbies': 'Baking;Writing'}])
    >>> hobbies = df['hobbies'].str.split(';', expand=True)
    >>> hobbies
        0          1       2       3
    0 Art     Soccer Writing    None
    1 Art     Baking    Golf Singing
    2 Baking Writing    None    None 
    
  2. 按名称明确兴趣爱好:

    >>> df = df.drop('hobbies', axis=1)
    >>> df = df.join(hobbies)
    >>> stacked = df.melt('name', value_name='hobby').drop('variable', axis=1)
    >>> stacked
       name   hobby
     0  Han     Art
     1 Leia     Art
     2 Luke  Baking
     3  Han  Soccer
     4 Leia  Baking
     5 Luke Writing
     6  Han Writing
     7 Leia    Golf
     8 Luke    None
     9  Han    None
    10 Leia Singing
    11 Luke    None
    
  3. 计数值:

    >>> counts = stacked.groupby('name')['hobby'].value_counts()
    >>> result = counts.unstack(level=-1).fillna(0).astype(int)
    >>> result
    hobby Art Baking Golf Singing Soccer Writing
    name                        
     Han    1      0    0       0      1       1
    Leia    1      1    1       1      0       0
    Luke    0      1    0       0      0       1
    

第2步和第3步有其他选择,例如使用get_dummiescrosstab,如下所述:Pandas get_dummies on multiple columns,但是第一个会消耗您的内存,第二个是慢得多。


参考文献:
Pandas split column into multiple columns by comma
Pandas DataFrame stack multiple column values into single column