Python:对不同文件的多次写入

时间:2015-06-19 15:15:00

标签: python file-io

import pickle
import time

def save_dict(name, dict_to_save):
    stime = time.time()
    with open(name, 'wb') as output:
        pickle.dump(dict_to_save, output, 1)
    print 'done. (%.3f secs)' % (time.time() - stime)

class SimpleObject(object):

    def __init__(self, name):
        self.name = name
        return

obj_dict1 = {}
obj_dict2 = {}
obj_dict3 = {}
for i in range(90000):
    if i < 30000:
        obj_dict1[i] = SimpleObject(i)
    elif i < 60000:
        obj_dict2[i] = SimpleObject(i)
    else:
        obj_dict3[i] = SimpleObject(i)

save_dict('zzz.1', obj_dict1)
save_dict('zzz.2', obj_dict2)
save_dict('zzz.3', obj_dict3)

输出:

done. (1.997 secs)
done. (2.067 secs)
done. (2.020 secs)

我希望写入并行发生,所以我尝试使用线程

import pickle
import time
import threading

def save_dict(name, dict_to_save):
    stime = time.time()
    with open(name, 'wb') as output:
        pickle.dump(dict_to_save, output, 1)
    print 'done. (%.3f secs)' % (time.time() - stime)

class SimpleObject(object):

    def __init__(self, name):
        self.name = name
        return

obj_dict1 = {}
obj_dict2 = {}
obj_dict3 = {}
for i in range(90000):
    if i < 30000:
        obj_dict1[i] = SimpleObject(i)
    elif i < 60000:
        obj_dict2[i] = SimpleObject(i)
    else:
        obj_dict3[i] = SimpleObject(i)


names =['zzz.1', 'zzz.2', 'zzz.3']
dicts = [obj_dict1, obj_dict2, obj_dict3]
thrs = [threading.Thread(target=save_dict, args=(info, data)) for (info, data) in zip(names, dicts)]
for thr in thrs:
    thr.start()
for thr in thrs:
    thr.join()

输出:

done. (10.761 secs)
done. (11.283 secs)
done. (11.286 secs)

但需要更多时间;我假设是因为GIL?

我试过使用多处理但是我得到了:

  File "multiwrite.py", line 30, in <module>
    pool = multiprocessing.Pool(processes=4)
  File "/usr/lib64/python2.6/multiprocessing/__init__.py", line 227, in Pool
    return Pool(processes, initializer, initargs)
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 84, in __init__
    self._setup_queues()
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 131, in _setup_queues
    self._inqueue = SimpleQueue()
  File "/usr/lib64/python2.6/multiprocessing/queues.py", line 328, in __init__
    self._rlock = Lock()
  File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 117, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1)
  File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 49, in __init__
    sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 13] Permission denied

所以我试图使用os.fork()方法,但我没有成功。

是否有任何建议要并行完成写入?

1 个答案:

答案 0 :(得分:1)

当您尝试同时写入多个文件时,只有花费更多时间计算数据而不是写入,或者您的文件全部位于不同的物理设备上才有意义。

通过顺序访问,Bots HDD和SSD可以更好地工作 。执行交错I / O会损害性能(考虑常量写头重新定位)。

这是最可能的原因。尽可能使用顺序流式I / O.

此外,您的任务不受I / O约束,而是受CPU限制,而Python的线程只会受到锁争用的影响。

您的程序会创建相对少量的数据并将其写入文件。有可能您的操作系统首先将数据完全获取到文件系统缓存,然后写入。您的代码中的大部分时间也可能花费在pickle中,这是受CPU限制的,并且一次只执行一个线程。我在实践中已经看到了这一点,虽然你的数据很简单,但在复杂的对象图上却非常明显。