我有一个函数,它接受一个url列表并为每个url添加一个标题。 url_list可以是大约25,000个长列表。所以,我想使用多处理。我尝试了两种失败的方法:
第一种方式 - url_list没有正确传递...该函数只获得第一个字母' h' url_list网址:
headers = {}
header_token = {}
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(do_it, url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "head.py", line 95, in <module>
pool.map(do_it, url_list)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
requests.exceptions.MissingSchema: Invalid URL u'h': No schema supplied
第二种方式......我更喜欢这种方式,因为我不必将标题字典全局化。但我得到一个泡菜错误:
def wrapper(headers):
def do_it(url_list):
for i in url_list:
print "adding header to: \n" + i
requests.post(i, headers=headers)
print "done!"
return do_it
value = raw_input("Proceed? Enter [Y] for yes: ")
if value == "Y":
pool = multiprocessing.Pool(processes=8)
pool.map(wrapper(headers), url_list)
pool.close()
pool.join()
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 761, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
答案 0 :(得分:1)
如果您希望使用第二个实现,那么我认为您应该能够使用dill来序列化您的包装函数。 Dill可以在python中序列化几乎任何东西。 Dill还有some good tools帮助您了解在代码失败时导致酸洗失败的原因。 Dill与python的pickle
具有相同的接口,但也提供了一些其他方法。如果你想使用dill与multiprocessing
进行序列化,你所要做的就是:
>>> import dill
>>> # your code goes here (as above)
并且,如果由于某种原因不起作用,您可以将multiprocessing
换成pathos ...这是为了使用dill进行多处理而构建的 - 并提供了多个* args map
函数(与标准python map
完全相同)。
答案 1 :(得分:0)
您需要使用多处理包中的队列。您提取或添加的数据类型需要是线程和进程安全的;一个队列都是。
http://docs.python.org/2/library/queue.html
http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes