将带有可变长度列表项的python dict导入到pandas中

时间:2013-08-16 18:50:54

标签: python dictionary pandas

我有以下词典:

d = {'col1': ['a', 'b', 'c'],
     'col2': [[1,2], [4,3,2], []],
}

我想要一个Pandas DataFrame:

idx, col1, col2
  0,  'a',  1
  1,  'a',  2
  2,  'b',  4
  3,  'b',  3
  4,  'b',  2
  5   'c',  nan

如何构建?如果我只是传递dict,它不会解开/重复col2中的列表项。 谢谢!

1 个答案:

答案 0 :(得分:3)

你只需要自己构建它。这是一种方式:

col1 = ['a', 'b', 'c']
col2 = [[1,2], [4,3,2], []]
col2_lens = map(len, col2)

# flatten col2
s2 = Series([eli for el in col2 for eli in (el or [nan])])

# replicate elements of col1 col2[i] times
s1 = Series(list(''.join(el * (col2_len or 1) for el, col2_len in zip(col1, col2_lens))))
concat([s1, s2], axis=1)

产生

   0   1
0  a   1
1  a   2
2  b   4
3  b   3
4  b   2
5  c NaN

以下是此处显示的3种方法的%%timeit

1

%%timeit
col2_lens = map(len, col2)

# flatten col2
s2 = Series([eli for el in col2 for eli in (el or [nan])])

# replicate elements of col1 col2[i] times
s1 = Series(list(''.join(el * (col2_len or 1) for el, col2_len in zip(col1, col2_lens))))
concat([s1, s2], axis=1)

1000 loops, best of 3: 646 µs per loop

2

%%timeit
df = DataFrame()
for a, b in zip(col1, col2):
    df = pd.concat([df, pd.DataFrame({'col1': a, 'col2': b or [np.nan]})])

100 loops, best of 3: 2.52 ms per loop

3

%%timeit
frames = []
for a, b in zip(col1, col2):
    frames.append(pd.DataFrame({'col1': a, 'col2': b or [np.nan]}))
df = pd.concat(frames)

1000 loops, best of 3: 1.58 ms per loop