pandas DataFrame拆分列并扩展行

时间:2015-12-16 14:24:04

标签: python pandas

像:

     A  B  C  D
  1  1  2  3  ['a','b']   
  2  4  6  7  ['b','c']
  3  1  0  1  ['a']
  4  2  1  1  ['b']
  5  1  2  3  [] 

为:

     A  B  C  D
  1  1  2  3  ['a']
  2  1  2  3  ['b']
  3  4  6  7  ['b']
  4  4  6  7  ['c']
  5  1  0  1  ['a']
  6  2  1  1  ['b']
  7  1  2  3  [] 

ps:将行分成“D”并延长行
使用:pandas dataframe处理数据

2 个答案:

答案 0 :(得分:3)

一种方法是使用带有双嵌套for循环的列表推导:

 >>> [(key + (item,)) 
      for key, val in df.set_index(['A','B','C'])['D'].iteritems()
      for item in map(list, val) or [[]]]

# [(1, 2, 3, ['a']),
#  (1, 2, 3, ['b']),
#  (4, 6, 7, ['b']),
#  (4, 6, 7, ['c']),
#  (1, 0, 1, ['a']),
#  (2, 1, 1, ['b']),
#  (1, 2, 3, [])]

将此表单中的数据传递给pd.DataFrame会产生所需的结果:

import pandas as pd
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
 'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
 'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
 'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
result = pd.DataFrame(
    [(key + (item,)) 
     for key, val in df.set_index(['A','B','C'])['D'].iteritems()
     for item in map(list, val) or [[]]])

产量

   0  1  2    3
0  1  2  3  [a]
1  1  2  3  [b]
2  4  6  7  [b]
3  4  6  7  [c]
4  1  0  1  [a]
5  2  1  1  [b]
6  1  2  3   []

另一种选择是使用df['D'].apply将列表中的项目展开到不同的列中,然后使用stack展开行:

df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
 'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
 'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
 'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
df = df.set_index(['A', 'B', 'C'])
result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
#          0    1
# A B C          
# 1 2 3  [a]  [b]
# 4 6 7  [b]  [c]
# 1 0 1  [a]  NaN
# 2 1 1  [b]  NaN
# 1 2 3   []  NaN

result = result.stack()
# A  B  C   
# 1  2  3  0    [a]
#          1    [b]
# 4  6  7  0    [b]
#          1    [c]
# 1  0  1  0    [a]
# 2  1  1  0    [b]
# 1  2  3  0     []
# dtype: object

result.index = result.index.droplevel(-1)
result = result.reset_index()
#    A  B  C    0
# 0  1  2  3  [a]
# 1  1  2  3  [b]
# 2  4  6  7  [b]
# 3  4  6  7  [c]
# 4  1  0  1  [a]
# 5  2  1  1  [b]
# 6  1  2  3   []

虽然这不使用显式for-loop或列表推导,但在apply的调用中隐藏了隐式for循环。事实上,它比使用列表理解要慢得多:

In [170]: df = pd.concat([df]*10)

In [171]: %%timeit
   .....: result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
result = result.stack()
result.index = result.index.droplevel(-1)
result = result.reset_index()
100 loops, best of 3: 11.5 ms per loop

In [172]: %%timeit
   .....: result = pd.DataFrame(
        [(key + (item,)) 
         for key, val in df['D'].iteritems()
         for item in map(list, val) or [[]]])
1000 loops, best of 3: 618 µs per loop

答案 1 :(得分:1)

假设您的column D内容属于type string

print(type(df.loc[0, 'D']))

<class 'str'>

df = df.set_index(['A', 'B', 'C']).sortlevel()
df.loc[:, 'D'] = df.loc[:, 'D'].str.strip('[').str.strip(']')
df = df.loc[:, 'D'].str.split(',', expand=True).stack()
df = df.str.strip('').apply(lambda x: '[{}]'.format(x)).reset_index().drop('level_3', axis=1).rename(columns={0: 'D'})


   A  B  C      D
0  1  0  1  ['a']
1  1  2  3  ['a']
2  1  2  3  ['b']
3  1  2  3     []
4  2  1  1  ['b']
5  4  6  7  ['b']
6  4  6  7  ['c']