Question

因此，一开始我就像遇到了另一个错误，但是在所有测试的最后，我都不确定这确实是一个错误，而且我了解如何使用熊猫HDF构建数据处理管道。

坐下，让我们一起骑行。希望您最后能为我澄清一下。

准备

SIZE_0 = 10**6
SIZE_1 = 4
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))
print (df.head())

          0         1         2         3
0  0.327362  0.084638  0.124322  0.116745
1  0.606545  0.484079  0.977239  0.120613
2  0.014407  0.973912  0.464409  0.959907
3  0.357551  0.641503  0.889408  0.776769
4  0.770845  0.548562  0.587054  0.569719

放入两个部分存储

cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
    store.put('df1', df[cols1], 't')
    store.put('df2', df[cols2], 't')

现在是问题所在。从HDFStore到select_as_multiple读取整个df比select慢很多：

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)

(1000000, 4)
CPU times: user 24.3 s, sys: 38.6 ms, total: 24.3 s
Wall time: 24.3 s

还有普通的select：

%%time
with pd.HDFStore('test.h5') as store:
    df1 = store.select('df1')
    df2 = store.select('df2')
    out = pd.concat([df1, df2], axis=1)
print (out.shape)

(1000000, 4)
CPU times: user 48.1 ms, sys: 23.9 ms, total: 72 ms
Wall time: 68.3 ms

因此，在这一点上，我打算将其发布为性能问题，但是经过一些初步的“哑巴检查”（按照我的想法），我收到了令人惊讶的结果（至少对我而言）。

让我们增加列数，看看会发生什么。

SIZE_1 = 8
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))

cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
    store.put('df1', df[cols1], 't')
    store.put('df2', df[cols2], 't')

现在为select_as_multiple使用相同的代码，我们将获得输出：

(1000000, 8)
CPU times: user 14.7 s, sys: 87.3 ms, total: 14.8 s
Wall time: 14.8 s

奇怪的东西。我们增加了两次数据大小，但现在的挂墙时间缩短了10s。

与此同时，select检索df的代码执行起来要慢一些：

(1000000, 8)
CPU times: user 90.6 ms, sys: 27.9 ms, total: 119 ms
Wall time: 115 ms

那之后，我无法停止好奇心，并再次尝试了:)。现在通过SIZE_1 = 16实例化df（同样，所有其他代码行均保持不变-为简便起见，此处不再复制它们）。

现在select_as_multiple的运行速度更快：

(1000000, 16)
CPU times: user 8.27 s, sys: 184 ms, total: 8.45 s
Wall time: 8.45 s

但是对于简单的select来说，一切都如预期-执行时间增加了：

(1000000, 16)
CPU times: user 181 ms, sys: 124 ms, total: 306 ms
Wall time: 302 ms

但是与此同时select的运行速度仍然快得多。

最后是问题：

1。为什么`select_as_multiple`与`select`相比工作如此差？

顺便说一句，这不仅是不指定where条件的选择问题：

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'], where='index < 500000')
print (out.shape)

(500000, 16)
CPU times: user 4.65 s, sys: 56.7 ms, total: 4.7 s
Wall time: 4.69 s

对于select：

%%time
with pd.HDFStore('test.h5') as store:
    df1 = store.select('df1', where='index < 500000')
    df2 = store.select('df2', where='index < 500000')
    out = pd.concat([df1, df2], axis=1)
print (out.shape)

(500000, 16)
CPU times: user 871 ms, sys: 89 ms, total: 960 ms
The decreasing of time (with such particular `where`) should be more expected, since we have to 
Wall time: 927 ms

仍然更快。但是，可能会注意，where减少select_as_multiple的时间，同时增加{{ 1}}。所以这是另一个问题：

2。为什么要为“ select_as_multiple”指定“ where”子句的时间减少，同时为“ select”指定“ where”子句的时间呢？

对于特定的select，预期的行为正在增加或减少。但彼此并不相反。

3。为什么在axis = 1方向上增加数据大小会减少“ select_as_multiple”的读取时间？

我们正在增加数据大小，但是选择的执行速度快了好几倍？真奇怪。也许这是设计的“功能”，它说-在您的df中有真正大列数之前，不要使用HDF存储？但是我不记得docs中的情况。仅有相反的用例-恰好在select_as_multiple部分中，建议将数据分为“查询”列和“其他”列（因此减少存储的dfs中的列数）以加快查询速度。

让我们做一些更多测试。

where和SIZE_0 = 10**6：

SIZE_1 = 16

%%time with pd.HDFStore('test.h5') as store: out = store.select_as_multiple(['df1', 'df2']) print (out.shape)

在{strong>轴= 0 中两次(1000000, 16) CPU times: user 8.39 s, sys: 232 ms, total: 8.62 s Wall time: 8.64 s和SIZE_0 = 2*10**6增加df大小：

SIZE_1 = 16

%%time with pd.HDFStore('test.h5') as store: out = store.select_as_multiple(['df1', 'df2']) print (out.shape)

相对于在轴= 1 上两次(2000000, 16) CPU times: user 32.3 s, sys: 370 ms, total: 32.6 s Wall time: 32.6 s和SIZE_0 = 10**6增加df大小：

SIZE_1 = 2*16

%%time with pd.HDFStore('test.h5') as store: out = store.select_as_multiple(['df1', 'df2']) print (out.shape)

所以(1000000, 32) CPU times: user 9.05 s, sys: 384 ms, total: 9.43 s Wall time: 9.43 s。

4。这意味着对于HDF存储而言，附加列而不是行要高效得多！！

那真是令人困惑。是不是错了据我了解，32s v 10s与PyTables是面向行的吗？

从后面的测试中可能会注意到，在向pandas HDF方向扩展数据之后，我们最终增加了执行时间。让我们找出确切的时间开始计时：

axis=1

因此初始动态（最多23到24列）非常简单-多列=更快的读取速度。

5。是否像设计阈值那样计数24列（我们应在此处划分2 df，即约12列）？而且只有在达到这一标准之后，才应该考虑使用HDF存储吗？

一些系统信息：

SIZE_0 = 10**6 SIZE_1s = list(range(4, 40)) # List of SIZE_1 to iterate # Iterating timings = [] for SIZE_1 in SIZE_1s: df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1)) cols1 = list(df.columns[:SIZE_1//2]) cols2 = list(df.columns[SIZE_1//2:]) with pd.HDFStore('test.h5') as store: # Put to store store.put('df1', df[cols1], 't') store.put('df2', df[cols2], 't') # Read from store, note the time start = pd.Timestamp.now() out = store.select_as_multiple(['df1', 'df2']) # Appending timings timings.append((pd.Timestamp.now()-start).total_seconds()) # Plotting to_plot = pd.DataFrame(timings, index=pd.Index(SIZE_1s, name='column_C'), columns=['read time']) _ = to_plot.plot(figsize = (14, 6), title = 'DF read time from HDF store by DFs column count', color = 'blue')

pd.__version__ tables.__version__

还'0.24.2' '3.5.1'安装在64位Ubuntu 19.04上。同时，测试中使用的最大df约为24GB大小。因此，它应该不会出现任何问题。

熊猫HDFStore检索数据性能了解

最后是问题：

1。为什么`select_as_multiple`与`select`相比工作如此差？

2。为什么要为“ select_as_multiple”指定“ where”子句的时间减少，同时为“ select”指定“ where”子句的时间呢？

3。为什么在axis = 1方向上增加数据大小会减少“ select_as_multiple”的读取时间？

4。这意味着对于HDF存储而言，附加列而不是行要高效得多！！

5。是否像设计阈值那样计数24列（我们应在此处划分2 df，即约12列）？而且只有在达到这一标准之后，才应该考虑使用HDF存储吗？

0 个答案: