我正在使用Python Multiprocessing模块来搜索网站。现在这个网站有超过100,000页。我想要做的是将我检索的每500页放入一个单独的文件夹中。问题是虽然我成功创建了一个新文件夹,但我的脚本只填充了上一个文件夹。这是代码:
global a = 1
global b = 500
def fetchAfter(y):
global a
global b
strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"
if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):
f = open(strfile, "w")
if __name__ == '__main__':
start = time.time()
for i in range(1,3):
os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))
pool = Pool(processes=12)
pool.map(fetchAfter, range(a,b))
pool.close()
pool.join()
a = b
b = b + 500
print time.time()-start
答案 0 :(得分:1)
最好只有 的工作者函数依赖于它所获得的单个参数来确定要做什么。因为这是每次调用它时从父进程获取的唯一信息。这个参数几乎可以是任何Python对象(包括元组,字典,列表),因此您不会真正限制传递给工作人员的信息量。
所以列出2元组。每个2元组应该包括(1)要获取的文件和(2)存储它的目录。将该元组列表提供给map()
,然后将其翻录。
我不确定指定要使用的进程数是否有用。池通常使用与CPU具有核心一样多的进程。这通常足以最大化所有核心。 : - )
顺便说一句,你应该只调用map()
一次。由于map()
阻止所有内容完成,因此无需调用join()
。
编辑:在下面添加了示例代码。
import multiprocessing
import requests
import os
def processfile(arg):
"""Worker function to scrape the pages and write them to a file.
Keyword arguments:
arg -- 2-tuple containing the URL of the page and the directory
where to save it.
"""
# Unpack the arguments
url, savedir = arg
# It might be a good idea to put a random delay of a few seconds here,
# so we don't hammer the webserver!
# Scrape the page. Requests rules ;-)
r = requests.get(url)
# Write it, keep the original HTML file name.
fname = url.split('/')[-1]
with open(savedir + '/' + fname, 'w+') as outfile:
outfile.write(r.text)
def main():
"""Main program.
"""
# This list of tuples should hold all the pages...
# Up to you how to generate it, this is just an example.
worklist = [('http://www.foo.org/page1.html', 'dir1'),
('http://www.foo.org/page2.html', 'dir1'),
('http://www.foo.org/page3.html', 'dir2'),
('http://www.foo.org/page4.html', 'dir2')]
# Create output directories
dirlist = ['dir1', 'dir2']
for d in dirlist:
os.makedirs(d)
p = Pool()
# Let'er rip!
p.map(processfile, worklist)
p.close()
if __name__ == '__main__':
main()
答案 1 :(得分:0)
顾名思义,多处理使用单独的进程。您使用Pool
创建的流程无法访问您在主程序中添加500的a
和b
的原始值。请参阅this previous question。
最简单的解决方案是重构您的代码,以便将a
和b
传递给fetchAfter
(除了传递y
)。
答案 2 :(得分:0)
这是实现它的一种方法:
#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib
def download_page(url_path):
try:
urllib.urlretrieve(*url_path)
mp.get_logger().info('done %s' % (url_path,))
except Exception as e:
mp.get_logger().error('failed %s: %s' % (url_path, e))
def generate_url_path(rootdir, urls_per_dir=500):
for i in xrange(100*1000):
if i % urls_per_dir == 0: # make new dir
dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
if not os.path.isdir(dirpath):
os.makedirs(dirpath) # stop if it fails
url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
path = os.path.join(dirpath, '%d.html' % (i,))
yield url, path
def main():
mp.log_to_stderr().setLevel(logging.INFO)
pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
# due to the task is IO-bound
for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
pass
if __name__ == '__main__':
main()
另请参阅Python multiprocessing pool.map for multiple arguments和
Brute force basic http authorization using httplib and multiprocessing