Question

我正在尝试使用pandas构建一个ETL工具包，hdf5。

我的计划是

将表从mysql提取到DataFrame;
将此DataFrame放入HDFStore;

但是当我在执行第2步时，我发现将数据帧放入* .h5文件需要花费太多时间。

源mysql服务器中表的大小：498MB
- 52列
- 924,624条记录
将数据框放入内容后* .h5文件的大小：513MB
- 'put'操作费用849.345677137秒

我的问题是：
这个时间成本是否正常？有没有办法让它更快？

更新1

谢谢Jeff

我的代码非常简单：

extract_store = HDFStore（'extract_store.h5'）
extract_store ['df_staff'] = df_staff
当我尝试'ptdump -av file.h5'时，我收到了一个错误，但我仍然可以从这个h5文件加载数据框对象：

tables.exceptions.HDF5ExtError：HDF5错误返回跟踪

文件“../../../src/H5F.c”，第1512行，在H5Fopen中       无法打开文件文件“../../../src/H5F.c”，第1307行，在H5F_open中       无法读取超级块文件“../../../src/H5Fsuper.c”，第305行，在H5F_super_read中       无法在H5F_locate_signature中找到文件签名文件“../../../src/H5Fsuper.c”，第153行       无法找到有效的文件签名

HDF5错误返回跟踪结束

无法打开/创建文件'extract_store.h5'

其他一些信息：
- pandas版本：'0.10.0'
- os：ubuntu server 10.04 x86_64
- cpu：8 * Intel（R）Xeon（R）CPU X5670 @ 2.93GHz
- MemTotal：51634016 kB

我会将pandas更新为0.10.1-dev并重试。

更新2

我已将熊猫更新为'0.10.1.dev-6e2b6ea'
但时间成本没有降低，这次花费884.15秒
'ptdump -av file.h5'的输出是：

    / (RootGroup) ''  
      /._v_attrs (AttributeSet), 4 attributes:  
       [CLASS := 'GROUP',  
        PYTABLES_FORMAT_VERSION := '2.0',  
        TITLE := '',  
        VERSION := '1.0']  
    /df_bugs (Group) ''  
      /df_bugs._v_attrs (AttributeSet), 12 attributes:  
       [CLASS := 'GROUP',  
        TITLE := '',  
        VERSION := '1.0',  
        axis0_variety := 'regular',  
        axis1_variety := 'regular',  
        block0_items_variety := 'regular',  
        block1_items_variety := 'regular',  
        block2_items_variety := 'regular',  
        nblocks := 3,  
        ndim := 2,  
        pandas_type := 'frame',  
        pandas_version := '0.10.1']  
    /df_bugs/axis0 (Array(52,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/axis1 (Array(924624,)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'integer',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_items (Array(5,)) ''  
      atom := StringAtom(itemsize=12, shape=(), dflt='')  
      maindim := 0   
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_values (Array(924624, 5)) ''  
      atom := Float64Atom(shape=(), dflt=0.0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block1_items (Array(19,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block1_values (Array(924624, 19)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',   
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block2_items (Array(28,)) ''  
      atom := StringAtom(itemsize=18, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block2_values (VLArray(1,)) ''  
      atom = ObjectAtom()  
      byteorder = 'irrelevant'  
      nrows = 1  
      flavor = 'numpy'  
      /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'VLARRAY',  
        PSEUDOATOM := 'object',  
        TITLE := '',   
        VERSION := '1.3',  
        transposed := True]

我尝试过下面的代码（将数据框放入hdfstore，param'table'为True），但是却出现了错误，似乎不支持python的数据时类型：

异常：找不到正确的原子类型 - ＆gt; [dtype-＆gt; object]对象类型'datetime.datetime'没有len（）

更新3

谢谢杰夫。抱歉延误。

表。版本：'2.4.0'。
是的，884秒只是没有来自mysql的拉取操作的put操作成本
一行数据帧（df.ix [0]）：

bug_id                                   1
assigned_to                            185
bug_file_loc                          None
bug_severity                      critical
bug_status                          closed
creation_ts            1998-05-06 21:27:00
delta_ts               2012-05-09 14:41:41
short_desc                    Two cursors.
host_op_sys                        Unknown
guest_op_sys                       Unknown
priority                                P3
rep_platform                          IA32
reporter                                56
product_id                               7
category_id                            983
component_id                         12925
resolution                           fixed
target_milestone                       ws1
qa_contact                             412
status_whiteboard                         
votes                                    0
keywords                                SR
lastdiffed             2012-05-09 14:41:41
everconfirmed                            1
reporter_accessible                      1
cclist_accessible                        1
estimated_time                        0.00
remaining_time                        0.00
deadline                              None
alias                                 None
found_in_product_id                      0
found_in_version_id                      0
found_in_phase_id                        0
cf_type                             Defect
cf_reported_by                 Development
cf_attempted                           NaN
cf_failed                              NaN
cf_public_summary                         
cf_doc_impact                            0
cf_security                              0
cf_build                               NaN
cf_branch                                 
cf_change                              NaN
cf_test_id                             NaN
cf_regression                      Unknown
cf_reviewer                              0
cf_on_hold                               0
cf_public_severity                     ---
cf_i18n_impact                           0
cf_eta                                None
cf_bug_source                          ---
cf_viss                               None
Name: 0, Length: 52

数据帧的图片（只需在ipython notebook中输入'df'）：


Int64Index: 924624 entries, 0 to 924623
Data columns:
bug_id                 924624  non-null values
assigned_to            924624  non-null values
bug_file_loc           427318  non-null values
bug_severity           924624  non-null values
bug_status             924624  non-null values
creation_ts            924624  non-null values
delta_ts               924624  non-null values
short_desc             924624  non-null values
host_op_sys            924624  non-null values
guest_op_sys           924624  non-null values
priority               924624  non-null values
rep_platform           924624  non-null values
reporter               924624  non-null values
product_id             924624  non-null values
category_id            924624  non-null values
component_id           924624  non-null values
resolution             924624  non-null values
target_milestone       924624  non-null values
qa_contact             924624  non-null values
status_whiteboard      924624  non-null values
votes                  924624  non-null values
keywords               924624  non-null values
lastdiffed             924509  non-null values
everconfirmed          924624  non-null values
reporter_accessible    924624  non-null values
cclist_accessible      924624  non-null values
estimated_time         924624  non-null values
remaining_time         924624  non-null values
deadline               0  non-null values
alias                  0  non-null values
found_in_product_id    924624  non-null values
found_in_version_id    924624  non-null values
found_in_phase_id      924624  non-null values
cf_type                924624  non-null values
cf_reported_by         924624  non-null values
cf_attempted           89622  non-null values
cf_failed              89587  non-null values
cf_public_summary      510799  non-null values
cf_doc_impact          924624  non-null values
cf_security            924624  non-null values
cf_build               327460  non-null values
cf_branch              614929  non-null values
cf_change              300612  non-null values
cf_test_id             12610  non-null values
cf_regression          924624  non-null values
cf_reviewer            924624  non-null values
cf_on_hold             924624  non-null values
cf_public_severity     924624  non-null values
cf_i18n_impact         924624  non-null values
cf_eta                 3910  non-null values
cf_bug_source          924624  non-null values
cf_viss                725  non-null values
dtypes: float64(5), int64(19), object(28)

在'convert_objects（）'之后：

dtypes: datetime64[ns](2), float64(5), int64(19), object(26)

并将转换后的数据框放入hdfstore费用： 749.50 s :)
- 似乎减少'对象'dtypes的数量是降低时间成本的关键
并将转换后的数据帧放入hdfstore并且param'table'为true仍然会返回该错误

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 
Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len()

我正在尝试将数据框放在没有日期时间列的情况下

更新4

mysql中有4列，其类型为datetime：
- creation_ts
- delta_ts
- lastdiffed
- 截止

调用convert_objects（）之后：

creation_ts：

Timestamp: 1998-05-06 21:27:00

delta_ts：

Timestamp: 2012-05-09 14:41:41

lastdiffed

datetime.datetime(2012, 5, 9, 14, 41, 41)

截止日期始终为无，无论是在调用'convert_objects'之前还是之后

None

将没有列'lastdiff'的数据框付费 691.75 s
当放置没有列'lastdiff'的数据帧并将param'table'设置为True时，我收到了一个新错误：

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 

Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len()

mysql中列'estimated_time'，'remaining_time'，'cf_viss'为'decimal'的类型

更新5

我已将这些'decimal'类型列转换为'float'类型，代码如下：

no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)

现在，时间成本 372.84 s
但'table'版本仍然引发了错误：

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2203                 raise
   2204             except (Exception), detail:
-> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
   2206             j += 1
   2207 

Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len()

Answer 1

我非常相信您的问题与DataFrame中实际类型的类型映射以及PyTables如何存储它们有关。

具有固定表示的简单类型（浮点数/整数/粗体），这些类型映射到固定的c类型
如果可以正确转换日期时间，则处理日期时间（例如，他们的dtype为'datetime64 [ns]'，特别是datetimes.date未处理（NaN是一个不同的故事，根据使用情况可能导致整个列类型为被错误处理）
将字符串映射（在Storer对象中为Object类型，Table将它们映射到String类型）
未处理Unicode
所有其他类型在Storers中作为Object处理，或者为表

这意味着如果您正在对存储器执行 put （固定表示），那么所有不可映射的类型都将成为Object，请参阅此内容。 PyTables腌制这些列。请参阅ObjectAtom的以下参考资料

http://pytables.github.com/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

表将引发无效类型（我应该在这里提供更好的错误消息）。我想如果您尝试存储映射到ObjectAtom的类型（出于性能原因），我也会提供警告。

要强制某些类型尝试其中一些：

import pandas as pd

# convert None to nan (its currently Object)
# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()

# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')

下载64位Linux上的示例（文件为1M行，磁盘大小约为1 GB）

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.10.1.dev'

In [3]: import tables

In [4]: tables.__version__
Out[4]: '2.3.1'

In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
   ...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])

In [5]: for x in range(20):
   ...:     df['String%03d' % x] = 'string%03d' % x

In [6]: df
Out[6]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)

# storer put (cannot query) 
In [9]: def test_put():
   ...:     store = pd.HDFStore('test_put.h5','w')
   ...:     store['df'] = df
   ...:     store.close()

In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop

# table put (can query)
In [7]: def test_put():
      ....:     store = pd.HDFStore('test_put.h5','w')
      ....:     store.put('df',df,table=True)
      ....:     store.close()


In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop

Answer 2

如何加快速度？

使用'io.sql.read_frame'将数据从sql db加载到数据帧。因为'read_frame'将通过将它们变为float来处理类型为'decimal'的列。
填写每列的缺失数据。
在执行操作
如果在dateframe中有字符串类型列，请使用'table'而不是'storer'

store.put（'key'，df，table = True）

完成这些工作后，使用相同的数据集，推送操作的性能有了很大的提升：

CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s
Wall time: 98.97 s

第二次测试的档案日志：

95984 function calls (95958 primitive calls) in 68.688 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      445   16.757    0.038   16.757    0.038 {numpy.core.multiarray.array}
       19   16.250    0.855   16.250    0.855 {method '_append_records' of 'tables.tableExtension.Table' objects}
       16    7.958    0.497    7.958    0.497 {method 'astype' of 'numpy.ndarray' objects}
       19    6.533    0.344    6.533    0.344 {pandas.lib.create_hdf_rows_2d}
        4    6.284    1.571    6.388    1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       20    2.640    0.132    2.641    0.132 {pandas.lib.maybe_convert_objects}
        1    1.785    1.785    1.785    1.785 {pandas.lib.isnullobj}
        7    1.619    0.231    1.619    0.231 {method 'flatten' of 'numpy.ndarray' objects}
       11    1.059    0.096    1.059    0.096 {pandas.lib.infer_dtype}
        1    0.997    0.997   41.952   41.952 pytables.py:2468(write_data)
       19    0.985    0.052   40.590    2.136 pytables.py:2504(write_data_chunk)
        1    0.827    0.827   60.617   60.617 pytables.py:2433(write)
     1504    0.592    0.000    0.592    0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
        4    0.534    0.133   13.676    3.419 pytables.py:1038(set_atom)
        1    0.528    0.528    0.528    0.528 {pandas.lib.max_len_string_array}
        4    0.441    0.110    0.571    0.143 internals.py:1409(_stack_arrays)
       35    0.358    0.010    0.358    0.010 {method 'copy' of 'numpy.ndarray' objects}
        1    0.276    0.276    3.135    3.135 internals.py:208(fillna)
        5    0.263    0.053    2.054    0.411 common.py:128(_isnull_ndarraylike)
       48    0.253    0.005    0.253    0.005 {method '_append' of 'tables.hdf5Extension.Array' objects}
        4    0.240    0.060    1.500    0.375 internals.py:1400(_simple_blockify)
        1    0.234    0.234   12.145   12.145 pytables.py:1066(set_atom_string)
       28    0.225    0.008    0.225    0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects}
       36    0.218    0.006    0.218    0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
     6110    0.155    0.000    0.155    0.000 {numpy.core.multiarray.empty}
        4    0.097    0.024    0.097    0.024 {method 'all' of 'numpy.ndarray' objects}
        6    0.084    0.014    0.084    0.014 {tables.indexesExtension.keysort}
       18    0.084    0.005    0.084    0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects}
    11816    0.064    0.000    0.108    0.000 file.py:1036(_getNode)
       19    0.053    0.003    0.053    0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
     1528    0.045    0.000    0.098    0.000 array.py:342(_interpret_indexing)
    11709    0.040    0.000    0.042    0.000 file.py:248(__getitem__)
        2    0.027    0.013    0.383    0.192 index.py:1099(get_neworder)
        1    0.018    0.018    0.018    0.018 {numpy.core.multiarray.putmask}
        4    0.013    0.003    0.017    0.004 index.py:607(final_idx32)

如何让pandas HDFStore'put'操作更快

更新1

更新2

更新3

更新4

更新5

2 个答案:

如何加快速度？