因此,一开始我就像遇到了另一个错误,但是在所有测试的最后,我都不确定这确实是一个错误,而且我了解如何使用熊猫HDF构建数据处理管道。
坐下,让我们一起骑行。希望您最后能为我澄清一下。
准备
SIZE_0 = 10**6
SIZE_1 = 4
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))
print (df.head())
0 1 2 3
0 0.327362 0.084638 0.124322 0.116745
1 0.606545 0.484079 0.977239 0.120613
2 0.014407 0.973912 0.464409 0.959907
3 0.357551 0.641503 0.889408 0.776769
4 0.770845 0.548562 0.587054 0.569719
放入两个部分存储
cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
store.put('df1', df[cols1], 't')
store.put('df2', df[cols2], 't')
现在是问题所在。从HDFStore
到select_as_multiple
读取整个df比select
慢很多:
%%time
with pd.HDFStore('test.h5') as store:
out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
(1000000, 4)
CPU times: user 24.3 s, sys: 38.6 ms, total: 24.3 s
Wall time: 24.3 s
还有普通的select
:
%%time
with pd.HDFStore('test.h5') as store:
df1 = store.select('df1')
df2 = store.select('df2')
out = pd.concat([df1, df2], axis=1)
print (out.shape)
(1000000, 4)
CPU times: user 48.1 ms, sys: 23.9 ms, total: 72 ms
Wall time: 68.3 ms
因此,在这一点上,我打算将其发布为性能问题,但是经过一些初步的“哑巴检查”(按照我的想法),我收到了令人惊讶的结果(至少对我而言)。
让我们增加列数,看看会发生什么。
SIZE_1 = 8
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))
cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
store.put('df1', df[cols1], 't')
store.put('df2', df[cols2], 't')
现在为select_as_multiple
使用相同的代码,我们将获得输出:
(1000000, 8)
CPU times: user 14.7 s, sys: 87.3 ms, total: 14.8 s
Wall time: 14.8 s
奇怪的东西。我们增加了两次数据大小,但现在的挂墙时间缩短了10s
。
与此同时,select
检索df的代码执行起来要慢一些:
(1000000, 8)
CPU times: user 90.6 ms, sys: 27.9 ms, total: 119 ms
Wall time: 115 ms
那之后,我无法停止好奇心,并再次尝试了:)。现在通过SIZE_1 = 16
实例化df(同样,所有其他代码行均保持不变-为简便起见,此处不再复制它们)。
现在select_as_multiple
的运行速度 更快 :
(1000000, 16)
CPU times: user 8.27 s, sys: 184 ms, total: 8.45 s
Wall time: 8.45 s
但是对于简单的select
来说,一切都如预期-执行时间增加了:
(1000000, 16)
CPU times: user 181 ms, sys: 124 ms, total: 306 ms
Wall time: 302 ms
但是与此同时select
的运行速度仍然快得多。
顺便说一句,这不仅是不指定where
条件的选择问题:
%%time
with pd.HDFStore('test.h5') as store:
out = store.select_as_multiple(['df1', 'df2'], where='index < 500000')
print (out.shape)
(500000, 16)
CPU times: user 4.65 s, sys: 56.7 ms, total: 4.7 s
Wall time: 4.69 s
对于select
:
%%time
with pd.HDFStore('test.h5') as store:
df1 = store.select('df1', where='index < 500000')
df2 = store.select('df2', where='index < 500000')
out = pd.concat([df1, df2], axis=1)
print (out.shape)
(500000, 16)
CPU times: user 871 ms, sys: 89 ms, total: 960 ms
The decreasing of time (with such particular `where`) should be more expected, since we have to
Wall time: 927 ms
仍然更快。但是,可能会注意,where
减少select_as_multiple
的时间,同时增加{{ 1}}。所以这是另一个问题:
对于特定的select
,预期的行为正在增加或减少。但彼此并不相反。
我们正在增加数据大小,但是选择的执行速度快了好几倍?真奇怪。也许这是设计的“功能”,它说-在您的df中有真正大列数之前,不要使用HDF存储?但是我不记得docs中的情况。仅有相反的用例-恰好在select_as_multiple
部分中,建议将数据分为“查询”列和“其他”列(因此减少存储的dfs中的列数)以加快查询速度。
让我们做一些更多测试。
where
和SIZE_0 = 10**6
:
SIZE_1 = 16
%%time
with pd.HDFStore('test.h5') as store:
out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
在{strong>轴= 0 中两次(1000000, 16)
CPU times: user 8.39 s, sys: 232 ms, total: 8.62 s
Wall time: 8.64 s
和SIZE_0 = 2*10**6
增加df大小:
SIZE_1 = 16
%%time
with pd.HDFStore('test.h5') as store:
out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
相对于在轴= 1 上两次(2000000, 16)
CPU times: user 32.3 s, sys: 370 ms, total: 32.6 s
Wall time: 32.6 s
和SIZE_0 = 10**6
增加df大小:
SIZE_1 = 2*16
%%time
with pd.HDFStore('test.h5') as store:
out = store.select_as_multiple(['df1', 'df2'])
print (out.shape)
所以(1000000, 32)
CPU times: user 9.05 s, sys: 384 ms, total: 9.43 s
Wall time: 9.43 s
。
那真是令人困惑。是不是错了据我了解,32s v 10s
与PyTables
是面向行的吗?
从后面的测试中可能会注意到,在向pandas HDF
方向扩展数据之后,我们最终增加了执行时间。让我们找出确切的时间开始计时:
axis=1
因此初始动态(最多23到24列)非常简单-多列=更快的读取速度。
一些系统信息:
SIZE_0 = 10**6
SIZE_1s = list(range(4, 40)) # List of SIZE_1 to iterate
# Iterating
timings = []
for SIZE_1 in SIZE_1s:
df = pd.DataFrame(np.random.rand(SIZE_0, SIZE_1))
cols1 = list(df.columns[:SIZE_1//2])
cols2 = list(df.columns[SIZE_1//2:])
with pd.HDFStore('test.h5') as store:
# Put to store
store.put('df1', df[cols1], 't')
store.put('df2', df[cols2], 't')
# Read from store, note the time
start = pd.Timestamp.now()
out = store.select_as_multiple(['df1', 'df2'])
# Appending timings
timings.append((pd.Timestamp.now()-start).total_seconds())
# Plotting
to_plot = pd.DataFrame(timings,
index=pd.Index(SIZE_1s, name='column_C'),
columns=['read time'])
_ = to_plot.plot(figsize = (14, 6),
title = 'DF read time from HDF store by DFs column count',
color = 'blue')
pd.__version__
tables.__version__
还'0.24.2'
'3.5.1'
安装在64位Ubuntu 19.04上。同时,测试中使用的最大df约为24GB
大小。因此,它应该不会出现任何问题。