Question

我正在使用Python Multiprocessing模块来搜索网站。现在这个网站有超过100,000页。我想要做的是将我检索的每500页放入一个单独的文件夹中。问题是虽然我成功创建了一个新文件夹，但我的脚本只填充了上一个文件夹。这是代码：

global a = 1

global b = 500

def fetchAfter(y):

       global a

       global b

       strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"

       if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):

                f = open(strfile, "w")


if __name__ == '__main__':

       start = time.time()
       for i in range(1,3):
              os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))

              pool = Pool(processes=12)
              pool.map(fetchAfter, range(a,b))
              pool.close()
              pool.join()
              a = b
              b = b + 500

       print time.time()-start

Answer 1

最好只有的工作者函数依赖于它所获得的单个参数来确定要做什么。因为这是每次调用它时从父进程获取的唯一信息。这个参数几乎可以是任何Python对象（包括元组，字典，列表），因此您不会真正限制传递给工作人员的信息量。

所以列出2元组。每个2元组应该包括（1）要获取的文件和（2）存储它的目录。将该元组列表提供给map()，然后将其翻录。

我不确定指定要使用的进程数是否有用。池通常使用与CPU具有核心一样多的进程。这通常足以最大化所有核心。： - ）
顺便说一句，你应该只调用map()一次。由于map()阻止所有内容完成，因此无需调用join()。

编辑：在下面添加了示例代码。

import multiprocessing import requests import os def processfile(arg): """Worker function to scrape the pages and write them to a file. Keyword arguments: arg -- 2-tuple containing the URL of the page and the directory where to save it. """ # Unpack the arguments url, savedir = arg # It might be a good idea to put a random delay of a few seconds here, # so we don't hammer the webserver! # Scrape the page. Requests rules ;-) r = requests.get(url) # Write it, keep the original HTML file name. fname = url.split('/')[-1] with open(savedir + '/' + fname, 'w+') as outfile: outfile.write(r.text) def main(): """Main program. """ # This list of tuples should hold all the pages... # Up to you how to generate it, this is just an example. worklist = [('http://www.foo.org/page1.html', 'dir1'), ('http://www.foo.org/page2.html', 'dir1'), ('http://www.foo.org/page3.html', 'dir2'), ('http://www.foo.org/page4.html', 'dir2')] # Create output directories dirlist = ['dir1', 'dir2'] for d in dirlist: os.makedirs(d) p = Pool() # Let'er rip! p.map(processfile, worklist) p.close() if __name__ == '__main__': main()

Answer 2

顾名思义，多处理使用单独的进程。您使用Pool创建的流程无法访问您在主程序中添加500的a和b的原始值。请参阅this previous question。

最简单的解决方案是重构您的代码，以便将a和b传递给fetchAfter（除了传递y）。

Answer 3

这是实现它的一种方法：

#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib

def download_page(url_path):
    try:
        urllib.urlretrieve(*url_path)
        mp.get_logger().info('done %s' % (url_path,))
    except Exception as e:
        mp.get_logger().error('failed %s: %s' % (url_path, e))

def generate_url_path(rootdir, urls_per_dir=500):
    for i in xrange(100*1000):
        if i % urls_per_dir == 0: # make new dir
           dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
           if not os.path.isdir(dirpath):
              os.makedirs(dirpath) # stop if it fails
        url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
        path = os.path.join(dirpath, '%d.html' % (i,))
        yield url, path

def main():
    mp.log_to_stderr().setLevel(logging.INFO)

    pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
                      # due to the task is IO-bound
    for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
        pass

if __name__ == '__main__':
   main()

另请参阅Python multiprocessing pool.map for multiple arguments和
Brute force basic http authorization using httplib and multiprocessing

中的代码how to make HTTP in Python faster?

Python多处理和目录创建

3 个答案: