我为自己制造了刮板机。在同一页面上有多个目标,我想创建一个包含所有“ URL”的列表,然后将其抓取。抓取需要一些时间,我需要同时抓取它们。因为我不想为x url“维护” x Skripts,所以我要进行多重处理并为“列表”中的每个url生成一个进程。经过一番duckduckgo并在此处
https://keyboardinterrupt.org/multithreading-in-python-2-7/和此处
When should we call multiprocessing.Pool.join?阅读例如,我想到了提供的代码。
在cmd行中执行的代码将执行主循环,但不进入scrape()函数(内部将显示一些未输出的打印消息)。没有给出错误消息,脚本正常退出。
我想念什么?
我在Win64上使用Python 2.7。
我已经读过:
Threading pool similar to the multiprocessing Pool?
https://docs.python.org/2/library/threading.html
https://keyboardinterrupt.org/multithreading-in-python-2-7/
但是我没有帮助。
def main():
try:
from multiprocessing import process
from multiprocessing.pool import ThreadPool
from multiprocessing import pool
thread_count = 10 # Define the limit of concurrent running threads
thread_pool = ThreadPool(processes=thread_count) # Define the thread pool to keep track of the sub processes
known_threads = {}
list=[]
list=def_list() # Just assigns the url's to the list
for entry in range(len(list)):
print 'starting to scrape'
print list[entry]
known_threads[entry] = thread_pool.apply_async(scrape, args=(list[entry]))
thread_pool.close() # After all threads started we close the pool
thread_pool.join() # And wait until all threads are done
except Exception, err:
print Exception, err, 'Failed in main loop'
pass