Question

在我的数据处理应用程序中，我有大约80％的处理时间仅花在功能pandas.HDFStore.put上。尽管围绕类似问题存在各种SO问题，但我还没有找到有关如何以最有效的方式使用HDFStore的明确指南。

我有什么选择可以减少书写时间？

我的数据仅由float64列和一些备用的int列组成，它可能包含重复的索引和/或列名，并且是先验未排序的。它将是数十年来收集的数据（每秒分辨率），因此该解决方案应具有可扩展性。

我的基本用法情况如下：

# 1. Store creation
store = pd.HDFStore(pro['hdf_path'], complevel=7,
                    complib='blosc', fletcher32=True)

# 2. Iterative addition of new data
store.put('/table/T1', data, format='table', data_columns=True,
          append=True, index=False)

# 3. Basic queries of certain columns (I only need 'index' in 'where')
store.select('/table/T1', columns=['A', 'B', ...],
             where='index>="{}" & index<{sign}"{}"'.format(_t1, _t2))

# 4. Retrieving a tree with all tables and all column
#    names in that table (without loading it)
for path, groups, leaves in store.walk():
    ...
    for lv in sorted(leaves):
       _item_path = '/'.join([path, lv])
       columns = store.get_node('{}/table'.format(_item_path)).description._v_names

我很想知道如何更改以下参数以优化写入时间：

“ complib”，“ complevel”
使索引效率更高（也许只在最后调用create_table_index？
store.put / store.append的参数
我读了一些有关索引级别的内容，例如('medium', 6)，这会产生影响吗？
我可以通过删除例如将日期时间的索引从datetime（然后存储为Int64）降低到更有效的方法。毫秒？

（阅读并没有什么问题，因为store.select与where=...相当有效。）

感谢您的帮助，非常感谢！

HDFStore有效使用指南

0 个答案: