我正在尝试创建一个抓取网站上前100页的抓取工具:
我的代码是这样的:
def extractproducts(pagenumber):
contenturl = "http://websiteurl/page/" + str(pagenumber)
content = BeautifulSoup(urllib2.urlopen(contenturl).read())
print pagehtml
pagenumberlist = range(1, 101)
for pagenumber in pagenumberlist:
extractproducts(pagenumber)
如何在这种情况下使用线程模块,以便urllib使用多线程一次抓取多个URL?
/ newb out
答案 0 :(得分:0)
最有可能的是,您想使用multiprocessing。您可以使用Pool
并行执行多项操作:
from multiprocessing import Pool
# Note: This many threads may make your system unresponsive for a while
p = Pool(100)
# First argument is the function to call,
# second argument is a list of arguments
# (the function is called on each item in the list)
p.map(extractproducts, pagenumberlist)
如果你的函数返回任何内容,Pool.map
将返回一个返回值列表:
def f(x):
return x + 1
results = Pool().map(f, [1, 4, 5])
print(results) # [2, 5, 6]