import pickle
import time
def save_dict(name, dict_to_save):
stime = time.time()
with open(name, 'wb') as output:
pickle.dump(dict_to_save, output, 1)
print 'done. (%.3f secs)' % (time.time() - stime)
class SimpleObject(object):
def __init__(self, name):
self.name = name
return
obj_dict1 = {}
obj_dict2 = {}
obj_dict3 = {}
for i in range(90000):
if i < 30000:
obj_dict1[i] = SimpleObject(i)
elif i < 60000:
obj_dict2[i] = SimpleObject(i)
else:
obj_dict3[i] = SimpleObject(i)
save_dict('zzz.1', obj_dict1)
save_dict('zzz.2', obj_dict2)
save_dict('zzz.3', obj_dict3)
输出:
done. (1.997 secs)
done. (2.067 secs)
done. (2.020 secs)
我希望写入并行发生,所以我尝试使用线程
import pickle
import time
import threading
def save_dict(name, dict_to_save):
stime = time.time()
with open(name, 'wb') as output:
pickle.dump(dict_to_save, output, 1)
print 'done. (%.3f secs)' % (time.time() - stime)
class SimpleObject(object):
def __init__(self, name):
self.name = name
return
obj_dict1 = {}
obj_dict2 = {}
obj_dict3 = {}
for i in range(90000):
if i < 30000:
obj_dict1[i] = SimpleObject(i)
elif i < 60000:
obj_dict2[i] = SimpleObject(i)
else:
obj_dict3[i] = SimpleObject(i)
names =['zzz.1', 'zzz.2', 'zzz.3']
dicts = [obj_dict1, obj_dict2, obj_dict3]
thrs = [threading.Thread(target=save_dict, args=(info, data)) for (info, data) in zip(names, dicts)]
for thr in thrs:
thr.start()
for thr in thrs:
thr.join()
输出:
done. (10.761 secs)
done. (11.283 secs)
done. (11.286 secs)
但需要更多时间;我假设是因为GIL?
我试过使用多处理但是我得到了:
File "multiwrite.py", line 30, in <module>
pool = multiprocessing.Pool(processes=4)
File "/usr/lib64/python2.6/multiprocessing/__init__.py", line 227, in Pool
return Pool(processes, initializer, initargs)
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 84, in __init__
self._setup_queues()
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 131, in _setup_queues
self._inqueue = SimpleQueue()
File "/usr/lib64/python2.6/multiprocessing/queues.py", line 328, in __init__
self._rlock = Lock()
File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 117, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1)
File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 49, in __init__
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 13] Permission denied
所以我试图使用os.fork()方法,但我没有成功。
是否有任何建议要并行完成写入?
答案 0 :(得分:1)
当您尝试同时写入多个文件时,只有花费更多时间计算数据而不是写入,或者您的文件全部位于不同的物理设备上才有意义。
通过顺序访问,Bots HDD和SSD可以更好地工作 。执行交错I / O会损害性能(考虑常量写头重新定位)。
这是最可能的原因。尽可能使用顺序流式I / O.
此外,您的任务不受I / O约束,而是受CPU限制,而Python的线程只会受到锁争用的影响。
您的程序会创建相对少量的数据并将其写入文件。有可能您的操作系统首先将数据完全获取到文件系统缓存,然后写入。您的代码中的大部分时间也可能花费在pickle
中,这是受CPU限制的,并且一次只执行一个线程。我在实践中已经看到了这一点,虽然你的数据很简单,但在复杂的对象图上却非常明显。