熊猫HDFStore检索数据性能了解

时间:2019-06-02 01:36:09

标签: python pandas dataframe hdfstore

因此,一开始我就像遇到了另一个错误,但是在所有测试的最后,我都不确定这确实是一个错误,而且我了解如何使用熊猫HDF构建数据处理管道。

坐下,让我们一起骑行。希望您最后能为我澄清一下。



准备

SIZE_0 = 10**6
SIZE_1 = 4
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))
print (df.head())
          0         1         2         3
0  0.327362  0.084638  0.124322  0.116745
1  0.606545  0.484079  0.977239  0.120613
2  0.014407  0.973912  0.464409  0.959907
3  0.357551  0.641503  0.889408  0.776769
4  0.770845  0.548562  0.587054  0.569719

放入两个部分存储

cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
    store.put('df1', df[cols1], 't')
    store.put('df2', df[cols2], 't')


现在是问题所在。从HDFStoreselect_as_multiple读取整个df比select慢很多:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
(1000000, 4)
CPU times: user 24.3 s, sys: 38.6 ms, total: 24.3 s
Wall time: 24.3 s

还有普通的select

%%time
with pd.HDFStore('test.h5') as store:
    df1 = store.select('df1')
    df2 = store.select('df2')
    out = pd.concat([df1, df2], axis=1)
print (out.shape)
(1000000, 4)
CPU times: user 48.1 ms, sys: 23.9 ms, total: 72 ms
Wall time: 68.3 ms

因此,在这一点上,我打算将其发布为性能问题,但是经过一些初步的“哑巴检查”(按照我的想法),我收到了令人惊讶的结果(至少对我而言)。



让我们增加列数,看看会发生什么。

SIZE_1 = 8
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))

cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
    store.put('df1', df[cols1], 't')
    store.put('df2', df[cols2], 't')

现在为select_as_multiple使用相同的代码,我们将获得输出:

(1000000, 8)
CPU times: user 14.7 s, sys: 87.3 ms, total: 14.8 s
Wall time: 14.8 s

奇怪的东西。我们增加了两次数据大小,但现在的挂墙时间缩短了10s

与此同时,select检索df的代码执行起来要慢一些:

(1000000, 8)
CPU times: user 90.6 ms, sys: 27.9 ms, total: 119 ms
Wall time: 115 ms


那之后,我无法停止好奇心,并再次尝试了:)。现在通过SIZE_1 = 16实例化df(同样,所有其他代码行均保持不变-为简便起见,此处不再复制它们)。

现在select_as_multiple的运行速度 更快

(1000000, 16)
CPU times: user 8.27 s, sys: 184 ms, total: 8.45 s
Wall time: 8.45 s

但是对于简单的select来说,一切都如预期-执行时间增加了:

(1000000, 16)
CPU times: user 181 ms, sys: 124 ms, total: 306 ms
Wall time: 302 ms

但是与此同时select的运行速度仍然快得多。



最后是问题:

1。为什么`select_as_multiple`与`select`相比工作如此差?

顺便说一句,这不仅是不指定where条件的选择问题:

%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'], where='index < 500000')
print (out.shape)
(500000, 16)
CPU times: user 4.65 s, sys: 56.7 ms, total: 4.7 s
Wall time: 4.69 s

对于select

%%time
with pd.HDFStore('test.h5') as store:
    df1 = store.select('df1', where='index < 500000')
    df2 = store.select('df2', where='index < 500000')
    out = pd.concat([df1, df2], axis=1)
print (out.shape)
(500000, 16)
CPU times: user 871 ms, sys: 89 ms, total: 960 ms
The decreasing of time (with such particular `where`) should be more expected, since we have to 
Wall time: 927 ms

仍然更快。但是,可能会注意where减少select_as_multiple的时间,同时增加{{ 1}}。所以这是另一个问题:


2。为什么要为“ select_as_multiple”指定“ where”子句的时间减少,同时为“ select”指定“ where”子句的时间呢?

对于特定的select,预期的行为正在增加或减少。但彼此并不相反。


3。为什么在axis = 1方向上增加数据大小会减少“ select_as_multiple”的读取时间?

我们正在增加数据大小,但是选择的执行速度快了好几倍?真奇怪。也许这是设计的“功能”,它说-在您的df中有真正大列数之前,不要使用HDF存储?但是我不记得docs中的情况。仅有相反的用例-恰好在select_as_multiple部分中,建议将数据分为“查询”列和“其他”列(因此减少存储的dfs中的列数)以加快查询速度。


让我们做一些更多测试。

whereSIZE_0 = 10**6

SIZE_1 = 16
%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)

在{strong>轴= 0 中两次(1000000, 16) CPU times: user 8.39 s, sys: 232 ms, total: 8.62 s Wall time: 8.64 s SIZE_0 = 2*10**6增加df大小:

SIZE_1 = 16
%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)

相对于在轴= 1 上两次(2000000, 16) CPU times: user 32.3 s, sys: 370 ms, total: 32.6 s Wall time: 32.6 s SIZE_0 = 10**6增加df大小:

SIZE_1 = 2*16
%%time
with pd.HDFStore('test.h5') as store:
    out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)

所以(1000000, 32) CPU times: user 9.05 s, sys: 384 ms, total: 9.43 s Wall time: 9.43 s

4。这意味着对于HDF存储而言,附加列而不是行要高效得多!!

那真是令人困惑。是不是错了据我了解,32s v 10sPyTables是面向行的吗?


从后面的测试中可能会注意到,在向pandas HDF方向扩展数据之后,我们最终增加了执行时间。让我们找出确切的时间开始计时:

axis=1

OUTPUT

因此初始动态(最多23到24列)非常简单-多列=更快的读取速度。

5。是否像设计阈值那样计数24列(我们应在此处划分2 df,即约12列)?而且只有在达到这一标准之后,才应该考虑使用HDF存储吗?



一些系统信息:

SIZE_0 = 10**6
SIZE_1s = list(range(4, 40)) # List of SIZE_1 to iterate

# Iterating
timings = []
for SIZE_1 in SIZE_1s:
    df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))

    cols1 = list(df.columns[:SIZE_1//2])
    cols2 = list(df.columns[SIZE_1//2:])
    with pd.HDFStore('test.h5') as store:
        # Put to store
        store.put('df1', df[cols1], 't')
        store.put('df2', df[cols2], 't')
        # Read from store, note the time
        start = pd.Timestamp.now()
        out = store.select_as_multiple(['df1', 'df2'])
        # Appending timings
        timings.append((pd.Timestamp.now()-start).total_seconds())
# Plotting
to_plot = pd.DataFrame(timings,
                       index=pd.Index(SIZE_1s, name='column_C'),
                       columns=['read time'])
_ = to_plot.plot(figsize = (14, 6),
                 title = 'DF read time from HDF store by DFs column count',
                 color = 'blue')
pd.__version__
tables.__version__

'0.24.2' '3.5.1' 安装在64位Ubuntu 19.04上。同时,测试中使用的最大df约为24GB大小。因此,它应该不会出现任何问题。

0 个答案:

没有答案