我正在使用lxml
模块使用这段代码
def read_and_parse_url(url, queue):
""" Read and parse the url """
data = urllib2.urlopen(url).read()
root = lxml.html.fromstring(data)
queue.put(root)
def fetch_parallel(urls_to_load):
""" Read and parse urls in parallel """
result = Queue.Queue()
processes = [multiprocessing.Process(target = read_and_parse_url, args = (url,result)) for url in urls_to_load]
for p in processes:
p.start()
for p in processes:
p.join(15) # 15 seconds timeout
return result
使用队列模块(result = Queue.Queue()
),运行后我检查qsize
,大小为零,就像我从未在那里插入数据一样(它应该是50 +)。 / p>
如果我使用result = multiprocessing.Queue()
创建队列,qsize()
会正确显示大小,但是我遇到了一个新问题:当我在队列上使用get
方法时出现此错误:
Traceback (most recent call last):
File "test.py", line 329, in <module>
d = scrape()
File "test.py", line 172, in scrape
print parsed_urls.get()
File "lxml.etree.pyx", line 1021, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:37950)
File "lxml.etree.pyx", line 863, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:36699)
File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:10557)
AssertionError: invalid Element proxy at 36856848
一些说明:
- parsed_urls
只是队列
- 当我使用threading
模块时,一切都很完美。唯一的问题是我不能轻易地杀死线程,所以我切换到multiprocessing
模块。
使用Queue
模块和multiprocessing
模块有什么问题?它似乎不起作用。
任何线索?我几乎搜索了所有这些,但找不到任何答案。
答案 0 :(得分:1)
Queue.Queue 适用于多线程应用:https://docs.python.org/2/library/queue.html不适用于多进程应用。
multiprocessing.Queue 适用于多进程应用:https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
在此查看我的完整答案:Python Queue usage works in threading but (apparently) not in multiprocessing