如何让pandas HDFStore'put'操作更快

时间:2013-01-16 09:35:36

标签: python pandas

我正在尝试使用pandas构建一个ETL工具包,hdf5。

我的计划是

  1. 将表从mysql提取到DataFrame;
  2. 将此DataFrame放入HDFStore;
  3. 但是当我在执行第2步时,我发现将数据帧放入* .h5文件需要花费太多时间。

    • 源mysql服务器中表的大小:498MB
      • 52列
      • 924,624条记录
    • 将数据框放入内容后* .h5文件的大小:513MB
      • 'put'操作费用849.345677137秒

    我的问题是:
    这个时间成本是否正常? 有没有办法让它更快?


    更新1

    谢谢Jeff

    • 我的代码非常简单:

      extract_store = HDFStore('extract_store.h5')
      extract_store ['df_staff'] = df_staff

    • 当我尝试'ptdump -av file.h5'时,我收到了一个错误,但我仍然可以从这个h5文件加载数据框对象:
      

    tables.exceptions.HDF5ExtError:HDF5错误返回跟踪

         

    文件“../../../src/H5F.c”,第1512行,在H5Fopen中       无法打开文件文件“../../../src/H5F.c”,第1307行,在H5F_open中       无法读取超级块文件“../../../src/H5Fsuper.c”,第305行,在H5F_super_read中       无法在H5F_locate_signature中找到文件签名文件“../../../src/H5Fsuper.c”,第153行       无法找到有效的文件签名

         

    HDF5错误返回跟踪结束

         

    无法打开/创建文件'extract_store.h5'

    • 其他一些信息:
      • pandas版本:'0.10.0'
      • os:ubuntu server 10.04 x86_64
      • cpu:8 * Intel(R)Xeon(R)CPU X5670 @ 2.93GHz
      • MemTotal:51634016 kB

    我会将pandas更新为0.10.1-dev并重试。


    更新2

    • 我已将熊猫更新为'0.10.1.dev-6e2b6ea'
    • 但时间成本没有降低,这次花费884.15秒
    • 'ptdump -av file.h5'的输出是:
        / (RootGroup) ''  
          /._v_attrs (AttributeSet), 4 attributes:  
           [CLASS := 'GROUP',  
            PYTABLES_FORMAT_VERSION := '2.0',  
            TITLE := '',  
            VERSION := '1.0']  
        /df_bugs (Group) ''  
          /df_bugs._v_attrs (AttributeSet), 12 attributes:  
           [CLASS := 'GROUP',  
            TITLE := '',  
            VERSION := '1.0',  
            axis0_variety := 'regular',  
            axis1_variety := 'regular',  
            block0_items_variety := 'regular',  
            block1_items_variety := 'regular',  
            block2_items_variety := 'regular',  
            nblocks := 3,  
            ndim := 2,  
            pandas_type := 'frame',  
            pandas_version := '0.10.1']  
        /df_bugs/axis0 (Array(52,)) ''  
          atom := StringAtom(itemsize=19, shape=(), dflt='')  
          maindim := 0  
          flavor := 'numpy'  
          byteorder := 'irrelevant'  
          chunkshape := None  
          /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',  
            VERSION := '2.3',  
            kind := 'string',  
            name := None,  
            transposed := True]  
        /df_bugs/axis1 (Array(924624,)) ''  
          atom := Int64Atom(shape=(), dflt=0)  
          maindim := 0  
          flavor := 'numpy'  
          byteorder := 'little'  
          chunkshape := None  
          /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',  
            VERSION := '2.3',  
            kind := 'integer',  
            name := None,  
            transposed := True]  
        /df_bugs/block0_items (Array(5,)) ''  
          atom := StringAtom(itemsize=12, shape=(), dflt='')  
          maindim := 0   
          flavor := 'numpy'  
          byteorder := 'irrelevant'  
          chunkshape := None  
          /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',  
            VERSION := '2.3',  
            kind := 'string',  
            name := None,  
            transposed := True]  
        /df_bugs/block0_values (Array(924624, 5)) ''  
          atom := Float64Atom(shape=(), dflt=0.0)  
          maindim := 0  
          flavor := 'numpy'  
          byteorder := 'little'  
          chunkshape := None  
          /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',  
            VERSION := '2.3',  
            transposed := True]  
        /df_bugs/block1_items (Array(19,)) ''  
          atom := StringAtom(itemsize=19, shape=(), dflt='')  
          maindim := 0  
          flavor := 'numpy'  
          byteorder := 'irrelevant'  
          chunkshape := None  
          /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',  
            VERSION := '2.3',  
            kind := 'string',  
            name := None,  
            transposed := True]  
        /df_bugs/block1_values (Array(924624, 19)) ''  
          atom := Int64Atom(shape=(), dflt=0)  
          maindim := 0  
          flavor := 'numpy'  
          byteorder := 'little'  
          chunkshape := None  
          /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',   
            VERSION := '2.3',  
            transposed := True]  
        /df_bugs/block2_items (Array(28,)) ''  
          atom := StringAtom(itemsize=18, shape=(), dflt='')  
          maindim := 0  
          flavor := 'numpy'  
          byteorder := 'irrelevant'  
          chunkshape := None  
          /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes:  
           [CLASS := 'ARRAY',  
            FLAVOR := 'numpy',  
            TITLE := '',  
            VERSION := '2.3',
            kind := 'string',  
            name := None,  
            transposed := True]  
        /df_bugs/block2_values (VLArray(1,)) ''  
          atom = ObjectAtom()  
          byteorder = 'irrelevant'  
          nrows = 1  
          flavor = 'numpy'  
          /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes:  
           [CLASS := 'VLARRAY',  
            PSEUDOATOM := 'object',  
            TITLE := '',   
            VERSION := '1.3',  
            transposed := True]  
    
    • 我尝试过下面的代码(将数据框放入hdfstore,param'table'为True),但是却出现了错误,似乎不支持python的数据时类型:
      

    异常:找不到正确的原子类型 - > [dtype-> object]对象   类型'datetime.datetime'没有len()


    更新3

    谢谢杰夫。 抱歉延误。

    • 表。版本:'2.4.0'。
    • 是的,884秒只是没有来自mysql的拉取操作的put操作成本
    • 一行数据帧(df.ix [0]):
    bug_id                                   1
    assigned_to                            185
    bug_file_loc                          None
    bug_severity                      critical
    bug_status                          closed
    creation_ts            1998-05-06 21:27:00
    delta_ts               2012-05-09 14:41:41
    short_desc                    Two cursors.
    host_op_sys                        Unknown
    guest_op_sys                       Unknown
    priority                                P3
    rep_platform                          IA32
    reporter                                56
    product_id                               7
    category_id                            983
    component_id                         12925
    resolution                           fixed
    target_milestone                       ws1
    qa_contact                             412
    status_whiteboard                         
    votes                                    0
    keywords                                SR
    lastdiffed             2012-05-09 14:41:41
    everconfirmed                            1
    reporter_accessible                      1
    cclist_accessible                        1
    estimated_time                        0.00
    remaining_time                        0.00
    deadline                              None
    alias                                 None
    found_in_product_id                      0
    found_in_version_id                      0
    found_in_phase_id                        0
    cf_type                             Defect
    cf_reported_by                 Development
    cf_attempted                           NaN
    cf_failed                              NaN
    cf_public_summary                         
    cf_doc_impact                            0
    cf_security                              0
    cf_build                               NaN
    cf_branch                                 
    cf_change                              NaN
    cf_test_id                             NaN
    cf_regression                      Unknown
    cf_reviewer                              0
    cf_on_hold                               0
    cf_public_severity                     ---
    cf_i18n_impact                           0
    cf_eta                                None
    cf_bug_source                          ---
    cf_viss                               None
    Name: 0, Length: 52
    
    • 数据帧的图片(只需在ipython notebook中输入'df'):
    
    Int64Index: 924624 entries, 0 to 924623
    Data columns:
    bug_id                 924624  non-null values
    assigned_to            924624  non-null values
    bug_file_loc           427318  non-null values
    bug_severity           924624  non-null values
    bug_status             924624  non-null values
    creation_ts            924624  non-null values
    delta_ts               924624  non-null values
    short_desc             924624  non-null values
    host_op_sys            924624  non-null values
    guest_op_sys           924624  non-null values
    priority               924624  non-null values
    rep_platform           924624  non-null values
    reporter               924624  non-null values
    product_id             924624  non-null values
    category_id            924624  non-null values
    component_id           924624  non-null values
    resolution             924624  non-null values
    target_milestone       924624  non-null values
    qa_contact             924624  non-null values
    status_whiteboard      924624  non-null values
    votes                  924624  non-null values
    keywords               924624  non-null values
    lastdiffed             924509  non-null values
    everconfirmed          924624  non-null values
    reporter_accessible    924624  non-null values
    cclist_accessible      924624  non-null values
    estimated_time         924624  non-null values
    remaining_time         924624  non-null values
    deadline               0  non-null values
    alias                  0  non-null values
    found_in_product_id    924624  non-null values
    found_in_version_id    924624  non-null values
    found_in_phase_id      924624  non-null values
    cf_type                924624  non-null values
    cf_reported_by         924624  non-null values
    cf_attempted           89622  non-null values
    cf_failed              89587  non-null values
    cf_public_summary      510799  non-null values
    cf_doc_impact          924624  non-null values
    cf_security            924624  non-null values
    cf_build               327460  non-null values
    cf_branch              614929  non-null values
    cf_change              300612  non-null values
    cf_test_id             12610  non-null values
    cf_regression          924624  non-null values
    cf_reviewer            924624  non-null values
    cf_on_hold             924624  non-null values
    cf_public_severity     924624  non-null values
    cf_i18n_impact         924624  non-null values
    cf_eta                 3910  non-null values
    cf_bug_source          924624  non-null values
    cf_viss                725  non-null values
    dtypes: float64(5), int64(19), object(28)
    
    • 在'convert_objects()'之后:
    dtypes: datetime64[ns](2), float64(5), int64(19), object(26)
    
    • 并将转换后的数据框放入hdfstore费用: 749.50 s :)
      • 似乎减少'对象'dtypes的数量是降低时间成本的关键
    • 并将转换后的数据帧放入hdfstore并且param'table'为true仍然会返回该错误
    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
       2203                 raise
       2204             except (Exception), detail:
    -> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
       2206             j += 1
       2207 
    Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len()
    
    • 我正在尝试将数据框放在没有日期时间列的情况下

    更新4

    • mysql中有4列,其类型为datetime:
      • creation_ts
      • delta_ts
      • lastdiffed
      • 截止

    调用convert_objects()之后:

    • creation_ts:
    Timestamp: 1998-05-06 21:27:00
    
    • delta_ts:
    Timestamp: 2012-05-09 14:41:41
    
    • lastdiffed
    datetime.datetime(2012, 5, 9, 14, 41, 41)
    
    • 截止日期始终为无,无论是在调用'convert_objects'之前还是之后
    None
    
    • 将没有列'lastdiff'的数据框付费 691.75 s
    • 当放置没有列'lastdiff'的数据帧并将param'table'设置为True时,我收到了一个新错误:
    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
       2203                 raise
       2204             except (Exception), detail:
    -> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
       2206             j += 1
       2207 
    
    Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len()
    
    • mysql中列'estimated_time','remaining_time','cf_viss'为'decimal'的类型

    更新5

    • 我已将这些'decimal'类型列转换为'float'类型,代码如下:
    no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)
    
    • 现在,时间成本 372.84 s
    • 但'table'版本仍然引发了错误:
    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
       2203                 raise
       2204             except (Exception), detail:
    -> 2205                 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
       2206             j += 1
       2207 
    
    Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len()
    

2 个答案:

答案 0 :(得分:4)

我非常相信您的问题与DataFrame中实际类型的类型映射以及PyTables如何存储它们有关。

  • 具有固定表示的简单类型(浮点数/整数/粗体),这些类型映射到固定的c类型
  • 如果可以正确转换日期时间,则处理日期时间(例如,他们的dtype为'datetime64 [ns]',特别是datetimes.date未处理(NaN是一个不同的故事,根据使用情况可能导致整个列类型为被错误处理)
  • 将字符串映射(在Storer对象中为Object类型,Table将它们映射到String类型)
  • 未处理Unicode
  • 所有其他类型在Storers中作为Object处理,或者为表
  • 抛出异常

这意味着如果您正在对存储器执行 put (固定表示),那么所有不可映射的类型都将成为Object,请参阅此内容。 PyTables腌制这些列。请参阅ObjectAtom的以下参考资料

http://pytables.github.com/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

表将引发无效类型(我应该在这里提供更好的错误消息)。我想如果您尝试存储映射到ObjectAtom的类型(出于性能原因),我也会提供警告。

要强制某些类型尝试其中一些:

import pandas as pd

# convert None to nan (its currently Object)
# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()

# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')

下载64位Linux上的示例(文件为1M行,磁盘大小约为1 GB)

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.10.1.dev'

In [3]: import tables

In [4]: tables.__version__
Out[4]: '2.3.1'

In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
   ...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])

In [5]: for x in range(20):
   ...:     df['String%03d' % x] = 'string%03d' % x

In [6]: df
Out[6]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)

# storer put (cannot query) 
In [9]: def test_put():
   ...:     store = pd.HDFStore('test_put.h5','w')
   ...:     store['df'] = df
   ...:     store.close()

In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop

# table put (can query)
In [7]: def test_put():
      ....:     store = pd.HDFStore('test_put.h5','w')
      ....:     store.put('df',df,table=True)
      ....:     store.close()


In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop

答案 1 :(得分:2)

如何加快速度?

  1. 使用'io.sql.read_frame'将数据从sql db加载到数据帧。因为'read_frame'将通过将它们变为float来处理类型为'decimal'的列。
  2. 填写每列的缺失数据。
  3. 在执行操作
  4. 之前调用函数'DataFrame.convert_objects'
  5. 如果在dateframe中有字符串类型列,请使用'table'而不是'storer'
  6. store.put('key',df,table = True)

    完成这些工作后,使用相同的数据集,推送操作的性能有了很大的提升:

    CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s
    Wall time: 98.97 s
    

    第二次测试的档案日志:

    95984 function calls (95958 primitive calls) in 68.688 CPU seconds
    
       Ordered by: internal time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
          445   16.757    0.038   16.757    0.038 {numpy.core.multiarray.array}
           19   16.250    0.855   16.250    0.855 {method '_append_records' of 'tables.tableExtension.Table' objects}
           16    7.958    0.497    7.958    0.497 {method 'astype' of 'numpy.ndarray' objects}
           19    6.533    0.344    6.533    0.344 {pandas.lib.create_hdf_rows_2d}
            4    6.284    1.571    6.388    1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects}
           20    2.640    0.132    2.641    0.132 {pandas.lib.maybe_convert_objects}
            1    1.785    1.785    1.785    1.785 {pandas.lib.isnullobj}
            7    1.619    0.231    1.619    0.231 {method 'flatten' of 'numpy.ndarray' objects}
           11    1.059    0.096    1.059    0.096 {pandas.lib.infer_dtype}
            1    0.997    0.997   41.952   41.952 pytables.py:2468(write_data)
           19    0.985    0.052   40.590    2.136 pytables.py:2504(write_data_chunk)
            1    0.827    0.827   60.617   60.617 pytables.py:2433(write)
         1504    0.592    0.000    0.592    0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
            4    0.534    0.133   13.676    3.419 pytables.py:1038(set_atom)
            1    0.528    0.528    0.528    0.528 {pandas.lib.max_len_string_array}
            4    0.441    0.110    0.571    0.143 internals.py:1409(_stack_arrays)
           35    0.358    0.010    0.358    0.010 {method 'copy' of 'numpy.ndarray' objects}
            1    0.276    0.276    3.135    3.135 internals.py:208(fillna)
            5    0.263    0.053    2.054    0.411 common.py:128(_isnull_ndarraylike)
           48    0.253    0.005    0.253    0.005 {method '_append' of 'tables.hdf5Extension.Array' objects}
            4    0.240    0.060    1.500    0.375 internals.py:1400(_simple_blockify)
            1    0.234    0.234   12.145   12.145 pytables.py:1066(set_atom_string)
           28    0.225    0.008    0.225    0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects}
           36    0.218    0.006    0.218    0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
         6110    0.155    0.000    0.155    0.000 {numpy.core.multiarray.empty}
            4    0.097    0.024    0.097    0.024 {method 'all' of 'numpy.ndarray' objects}
            6    0.084    0.014    0.084    0.014 {tables.indexesExtension.keysort}
           18    0.084    0.005    0.084    0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects}
        11816    0.064    0.000    0.108    0.000 file.py:1036(_getNode)
           19    0.053    0.003    0.053    0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
         1528    0.045    0.000    0.098    0.000 array.py:342(_interpret_indexing)
        11709    0.040    0.000    0.042    0.000 file.py:248(__getitem__)
            2    0.027    0.013    0.383    0.192 index.py:1099(get_neworder)
            1    0.018    0.018    0.018    0.018 {numpy.core.multiarray.putmask}
            4    0.013    0.003    0.017    0.004 index.py:607(final_idx32)