以下代码显示了pytables和线程之间交互的问题。我正在创建一个HDF文件并用100个并发线程读取它:
import threading
import pandas as pd
from pandas.io.pytables import HDFStore, get_store
filename='test.hdf'
with get_store(filename,mode='w') as store:
store['x'] = pd.DataFrame({'y': range(10000)})
def process(i,filename):
# print 'start', i
with get_store(filename,mode='r') as store:
df = store['x']
# print 'end', i
return df['y'].max
threads = []
for i in range(100):
t = threading.Thread(target=process, args = (i,filename,))
t.daemon = True
t.start()
threads.append(t)
for t in threads:
t.join()
程序通常干净利落地执行。但是偶尔我会得到这样的例外:
Exception in thread Thread-27:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "crash.py", line 13, in process
with get_store(filename,mode='r') as store:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 259, in get_store
store = HDFStore(path, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 398, in __init__
self.open(mode=mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 528, in open
self._handle = tables.openFile(self._path, self._mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 298, in open_file
for filehandle in _open_files.get_handlers_by_name(filename):
RuntimeError: Set changed size during iteration
或
[...]
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 299, in open_file
omode = filehandle.mode
AttributeError: 'File' object has no attribute 'mode'
在减少代码的同时,我得到了非常不同的错误消息,其中一些指示内存损坏。
以下是我的图书馆版本:
>>> pd.__version__
'0.13.1'
>>> tables.__version__
'3.1.0'
我在编写文件时出现了错误,我通过使用选项重新编译hdf5来解决它: - enable-threadsafe --with-pthread
有人可以重现这个问题吗?怎么解决?
答案 0 :(得分:1)
Anthony已经指出hdf5(PyTables基本上是hdf5 C库的包装器)不是线程安全的。如果要从Web应用程序访问hdf5文件,基本上有两个选项:
请注意,使用mod_wsgi应用程序 - 取决于配置 - 您必须处理线程和进程!
我目前也在努力使用hdf5作为Web应用程序的数据库后端。我认为上面的第二种方法提供了一个不错的解决方但是,hdf5还不是数据库系统。如果您想要一个带有Python接口的真实阵列数据库服务器,请查看http://www.scidb.org。但它并不像基于hdf5的解决方案那么轻。
答案 1 :(得分:0)
PyTables不是完全线程安全的。改为使用多进程池。
答案 2 :(得分:0)
尚未提及的一点,使用以下命令重新编译HDF5为线程安全:
--enable-threadsafe --with-pthread=DIR
https://support.hdfgroup.org/HDF5/faq/threadsafe.html
我的keras代码中有一些难以发现的错误,它使用了HDF5,这就解决了它。