Python / pandas:来自dict系列的数据框:优化

时间:2016-02-24 10:52:32

标签: python pandas python-3.4

我有一个pandas系列词典,我想将它转换为具有相同索引的数据框。

我找到的唯一方法是通过系列的to_dict方法,这不是很有效,因为它回到了纯python模式而不是numpy / pandas / cython。

您对更好的方法有什么建议吗?

非常感谢。

>>> import pandas as pd
>>> flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
>>> flagInfoSeries
0      {'a': 1, 'b': 2}
1    {'a': 10, 'b': 20}
dtype: object
>>> pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

2 个答案:

答案 0 :(得分:3)

我认为你可以使用理解:

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
print flagInfoSeries
0      {u'a': 1, u'b': 2}
1    {u'a': 10, u'b': 20}
dtype: object

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

print pd.DataFrame([x for x in flagInfoSeries])
    a   b
0   1   2
1  10  20

<强>时序

In [203]: %timeit pd.DataFrame(flagInfoSeries.to_dict()).T
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 554 µs per loop

In [204]: %timeit pd.DataFrame([x for x in flagInfoSeries])
The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 361 µs per loop

In [209]: %timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 751 µs per loop

编辑:

如果您需要保留索引,请尝试将index=flagInfoSeries.index添加到DataFrame构造函数:

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)

<强>计时

In [257]: %timeit pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
1000 loops, best of 3: 350 µs per loop

<强>示例

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
flagInfoSeries.index = [2,8]
print flagInfoSeries
2      {u'a': 1, u'b': 2}
8    {u'a': 10, u'b': 20}

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
2   1   2
8  10  20

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
    a   b
2   1   2
8  10  20

答案 1 :(得分:0)

这可以避免to_dict,但apply也可能会很慢:

flagInfoSeries.apply(lambda dict: pd.Series(dict))

修改:我看到jezrael添加了时间比较。这是我的:

%timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
1000 loops, best of 3: 935 µs per loop