我正在尝试使用pandas构建一个ETL工具包,hdf5。
我的计划是
但是当我在执行第2步时,我发现将数据帧放入* .h5文件需要花费太多时间。
我的问题是:
这个时间成本是否正常?
有没有办法让它更快?
谢谢Jeff
我的代码非常简单:
extract_store = HDFStore('extract_store.h5')
extract_store ['df_staff'] = df_staff
tables.exceptions.HDF5ExtError:HDF5错误返回跟踪
文件“../../../src/H5F.c”,第1512行,在H5Fopen中 无法打开文件文件“../../../src/H5F.c”,第1307行,在H5F_open中 无法读取超级块文件“../../../src/H5Fsuper.c”,第305行,在H5F_super_read中 无法在H5F_locate_signature中找到文件签名文件“../../../src/H5Fsuper.c”,第153行 无法找到有效的文件签名
HDF5错误返回跟踪结束
无法打开/创建文件'extract_store.h5'
我会将pandas更新为0.10.1-dev并重试。
/ (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.0', TITLE := '', VERSION := '1.0'] /df_bugs (Group) '' /df_bugs._v_attrs (AttributeSet), 12 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', axis0_variety := 'regular', axis1_variety := 'regular', block0_items_variety := 'regular', block1_items_variety := 'regular', block2_items_variety := 'regular', nblocks := 3, ndim := 2, pandas_type := 'frame', pandas_version := '0.10.1'] /df_bugs/axis0 (Array(52,)) '' atom := StringAtom(itemsize=19, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/axis1 (Array(924624,)) '' atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'integer', name := None, transposed := True] /df_bugs/block0_items (Array(5,)) '' atom := StringAtom(itemsize=12, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block0_values (Array(924624, 5)) '' atom := Float64Atom(shape=(), dflt=0.0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', transposed := True] /df_bugs/block1_items (Array(19,)) '' atom := StringAtom(itemsize=19, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block1_values (Array(924624, 19)) '' atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', transposed := True] /df_bugs/block2_items (Array(28,)) '' atom := StringAtom(itemsize=18, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block2_values (VLArray(1,)) '' atom = ObjectAtom() byteorder = 'irrelevant' nrows = 1 flavor = 'numpy' /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'VLARRAY', PSEUDOATOM := 'object', TITLE := '', VERSION := '1.3', transposed := True]
异常:找不到正确的原子类型 - > [dtype-> object]对象 类型'datetime.datetime'没有len()
谢谢杰夫。 抱歉延误。
bug_id 1 assigned_to 185 bug_file_loc None bug_severity critical bug_status closed creation_ts 1998-05-06 21:27:00 delta_ts 2012-05-09 14:41:41 short_desc Two cursors. host_op_sys Unknown guest_op_sys Unknown priority P3 rep_platform IA32 reporter 56 product_id 7 category_id 983 component_id 12925 resolution fixed target_milestone ws1 qa_contact 412 status_whiteboard votes 0 keywords SR lastdiffed 2012-05-09 14:41:41 everconfirmed 1 reporter_accessible 1 cclist_accessible 1 estimated_time 0.00 remaining_time 0.00 deadline None alias None found_in_product_id 0 found_in_version_id 0 found_in_phase_id 0 cf_type Defect cf_reported_by Development cf_attempted NaN cf_failed NaN cf_public_summary cf_doc_impact 0 cf_security 0 cf_build NaN cf_branch cf_change NaN cf_test_id NaN cf_regression Unknown cf_reviewer 0 cf_on_hold 0 cf_public_severity --- cf_i18n_impact 0 cf_eta None cf_bug_source --- cf_viss None Name: 0, Length: 52
Int64Index: 924624 entries, 0 to 924623 Data columns: bug_id 924624 non-null values assigned_to 924624 non-null values bug_file_loc 427318 non-null values bug_severity 924624 non-null values bug_status 924624 non-null values creation_ts 924624 non-null values delta_ts 924624 non-null values short_desc 924624 non-null values host_op_sys 924624 non-null values guest_op_sys 924624 non-null values priority 924624 non-null values rep_platform 924624 non-null values reporter 924624 non-null values product_id 924624 non-null values category_id 924624 non-null values component_id 924624 non-null values resolution 924624 non-null values target_milestone 924624 non-null values qa_contact 924624 non-null values status_whiteboard 924624 non-null values votes 924624 non-null values keywords 924624 non-null values lastdiffed 924509 non-null values everconfirmed 924624 non-null values reporter_accessible 924624 non-null values cclist_accessible 924624 non-null values estimated_time 924624 non-null values remaining_time 924624 non-null values deadline 0 non-null values alias 0 non-null values found_in_product_id 924624 non-null values found_in_version_id 924624 non-null values found_in_phase_id 924624 non-null values cf_type 924624 non-null values cf_reported_by 924624 non-null values cf_attempted 89622 non-null values cf_failed 89587 non-null values cf_public_summary 510799 non-null values cf_doc_impact 924624 non-null values cf_security 924624 non-null values cf_build 327460 non-null values cf_branch 614929 non-null values cf_change 300612 non-null values cf_test_id 12610 non-null values cf_regression 924624 non-null values cf_reviewer 924624 non-null values cf_on_hold 924624 non-null values cf_public_severity 924624 non-null values cf_i18n_impact 924624 non-null values cf_eta 3910 non-null values cf_bug_source 924624 non-null values cf_viss 725 non-null values dtypes: float64(5), int64(19), object(28)
dtypes: datetime64[ns](2), float64(5), int64(19), object(26)
/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs) 2203 raise 2204 except (Exception), detail: -> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail))) 2206 j += 1 2207 Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len()
调用convert_objects()之后:
Timestamp: 1998-05-06 21:27:00
Timestamp: 2012-05-09 14:41:41
datetime.datetime(2012, 5, 9, 14, 41, 41)
None
/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs) 2203 raise 2204 except (Exception), detail: -> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail))) 2206 j += 1 2207 Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len()
no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)
/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs) 2203 raise 2204 except (Exception), detail: -> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail))) 2206 j += 1 2207 Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len()
答案 0 :(得分:4)
我非常相信您的问题与DataFrame中实际类型的类型映射以及PyTables如何存储它们有关。
这意味着如果您正在对存储器执行 put (固定表示),那么所有不可映射的类型都将成为Object,请参阅此内容。 PyTables腌制这些列。请参阅ObjectAtom的以下参考资料
表将引发无效类型(我应该在这里提供更好的错误消息)。我想如果您尝试存储映射到ObjectAtom的类型(出于性能原因),我也会提供警告。
要强制某些类型尝试其中一些:
import pandas as pd
# convert None to nan (its currently Object)
# converts to float64 (or type of other objs)
x = pd.Series([None])
x = x.where(pd.notnull(x)).convert_objects()
# convert datetime like with embeded nans to datetime64[ns]
df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]')
下载64位Linux上的示例(文件为1M行,磁盘大小约为1 GB)
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.10.1.dev'
In [3]: import tables
In [4]: tables.__version__
Out[4]: '2.3.1'
In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int(
...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)])
In [5]: for x in range(20):
...: df['String%03d' % x] = 'string%03d' % x
In [6]: df
Out[6]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 120 entries, E000 to String019
dtypes: float64(100), object(20)
# storer put (cannot query)
In [9]: def test_put():
...: store = pd.HDFStore('test_put.h5','w')
...: store['df'] = df
...: store.close()
In [10]: %timeit test_put()
1 loops, best of 3: 7.65 s per loop
# table put (can query)
In [7]: def test_put():
....: store = pd.HDFStore('test_put.h5','w')
....: store.put('df',df,table=True)
....: store.close()
In [8]: %timeit test_put()
1 loops, best of 3: 21.4 s per loop
答案 1 :(得分:2)
store.put('key',df,table = True)
完成这些工作后,使用相同的数据集,推送操作的性能有了很大的提升:
CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s
Wall time: 98.97 s
第二次测试的档案日志:
95984 function calls (95958 primitive calls) in 68.688 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 445 16.757 0.038 16.757 0.038 {numpy.core.multiarray.array} 19 16.250 0.855 16.250 0.855 {method '_append_records' of 'tables.tableExtension.Table' objects} 16 7.958 0.497 7.958 0.497 {method 'astype' of 'numpy.ndarray' objects} 19 6.533 0.344 6.533 0.344 {pandas.lib.create_hdf_rows_2d} 4 6.284 1.571 6.388 1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects} 20 2.640 0.132 2.641 0.132 {pandas.lib.maybe_convert_objects} 1 1.785 1.785 1.785 1.785 {pandas.lib.isnullobj} 7 1.619 0.231 1.619 0.231 {method 'flatten' of 'numpy.ndarray' objects} 11 1.059 0.096 1.059 0.096 {pandas.lib.infer_dtype} 1 0.997 0.997 41.952 41.952 pytables.py:2468(write_data) 19 0.985 0.052 40.590 2.136 pytables.py:2504(write_data_chunk) 1 0.827 0.827 60.617 60.617 pytables.py:2433(write) 1504 0.592 0.000 0.592 0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects} 4 0.534 0.133 13.676 3.419 pytables.py:1038(set_atom) 1 0.528 0.528 0.528 0.528 {pandas.lib.max_len_string_array} 4 0.441 0.110 0.571 0.143 internals.py:1409(_stack_arrays) 35 0.358 0.010 0.358 0.010 {method 'copy' of 'numpy.ndarray' objects} 1 0.276 0.276 3.135 3.135 internals.py:208(fillna) 5 0.263 0.053 2.054 0.411 common.py:128(_isnull_ndarraylike) 48 0.253 0.005 0.253 0.005 {method '_append' of 'tables.hdf5Extension.Array' objects} 4 0.240 0.060 1.500 0.375 internals.py:1400(_simple_blockify) 1 0.234 0.234 12.145 12.145 pytables.py:1066(set_atom_string) 28 0.225 0.008 0.225 0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects} 36 0.218 0.006 0.218 0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects} 6110 0.155 0.000 0.155 0.000 {numpy.core.multiarray.empty} 4 0.097 0.024 0.097 0.024 {method 'all' of 'numpy.ndarray' objects} 6 0.084 0.014 0.084 0.014 {tables.indexesExtension.keysort} 18 0.084 0.005 0.084 0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects} 11816 0.064 0.000 0.108 0.000 file.py:1036(_getNode) 19 0.053 0.003 0.053 0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects} 1528 0.045 0.000 0.098 0.000 array.py:342(_interpret_indexing) 11709 0.040 0.000 0.042 0.000 file.py:248(__getitem__) 2 0.027 0.013 0.383 0.192 index.py:1099(get_neworder) 1 0.018 0.018 0.018 0.018 {numpy.core.multiarray.putmask} 4 0.013 0.003 0.017 0.004 index.py:607(final_idx32)