Python多处理和目录创建

时间:2012-08-23 19:10:23

标签: python file directory multiprocessing

我正在使用Python Multiprocessing模块来搜索网站。现在这个网站有超过100,000页。我想要做的是将我检索的每500页放入一个单独的文件夹中。问题是虽然我成功创建了一个新文件夹,但我的脚本只填充了上一个文件夹。这是代码:

global a = 1

global b = 500

def fetchAfter(y):

       global a

       global b

       strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"

       if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):

                f = open(strfile, "w")


if __name__ == '__main__':

       start = time.time()
       for i in range(1,3):
              os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))

              pool = Pool(processes=12)
              pool.map(fetchAfter, range(a,b))
              pool.close()
              pool.join()
              a = b
              b = b + 500

       print time.time()-start

3 个答案:

答案 0 :(得分:1)

最好只有 的工作者函数依赖于它所获得的单个参数来确定要做什么。因为这是每次调用它时从父进程获取的唯一信息。这个参数几乎可以是任何Python对象(包括元组,字典,列表),因此您不会真正限制传递给工作人员的信息量。

所以列出2元组。每个2元组应该包括(1)要获取的文件和(2)存储它的目录。将该元组列表提供给map(),然后将其翻录。

我不确定指定要使用的进程数是否有用。池通常使用与CPU具有核心一样多的进程。这通常足以最大化所有核心。 : - )

顺便说一句,你应该只调用map()一次。由于map()阻止所有内容完成,因此无需调用join()

编辑:在下面添加了示例代码。

import multiprocessing
import requests
import os

def processfile(arg):
    """Worker function to scrape the pages and write them to a file.

    Keyword arguments:
    arg -- 2-tuple containing the URL of the page and the directory
           where to save it.
    """
    # Unpack the arguments
    url, savedir = arg

    # It might be a good idea to put a random delay of a few seconds here, 
    # so we don't hammer the webserver!

    # Scrape the page. Requests rules ;-)
    r = requests.get(url)
    # Write it, keep the original HTML file name.
    fname = url.split('/')[-1]
    with open(savedir + '/' + fname, 'w+') as outfile:
        outfile.write(r.text)

def main():
    """Main program.
    """
    # This list of tuples should hold all the pages... 
    # Up to you how to generate it, this is just an example.
    worklist = [('http://www.foo.org/page1.html', 'dir1'), 
                ('http://www.foo.org/page2.html', 'dir1'), 
                ('http://www.foo.org/page3.html', 'dir2'), 
                ('http://www.foo.org/page4.html', 'dir2')]
    # Create output directories
    dirlist = ['dir1', 'dir2']
    for d in dirlist:
        os.makedirs(d)
    p = Pool()
    # Let'er rip!
    p.map(processfile, worklist)
    p.close()

if __name__ == '__main__':
    main()

答案 1 :(得分:0)

顾名思义,多处理使用单独的进程。您使用Pool创建的流程无法访问您在主程序中添加500的ab的原始值。请参阅this previous question

最简单的解决方案是重构您的代码,以便将ab传递给fetchAfter(除了传递y)。

答案 2 :(得分:0)

这是实现它的一种方法:

#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib

def download_page(url_path):
    try:
        urllib.urlretrieve(*url_path)
        mp.get_logger().info('done %s' % (url_path,))
    except Exception as e:
        mp.get_logger().error('failed %s: %s' % (url_path, e))

def generate_url_path(rootdir, urls_per_dir=500):
    for i in xrange(100*1000):
        if i % urls_per_dir == 0: # make new dir
           dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
           if not os.path.isdir(dirpath):
              os.makedirs(dirpath) # stop if it fails
        url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
        path = os.path.join(dirpath, '%d.html' % (i,))
        yield url, path

def main():
    mp.log_to_stderr().setLevel(logging.INFO)

    pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
                      # due to the task is IO-bound
    for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
        pass

if __name__ == '__main__':
   main()

另请参阅Python multiprocessing pool.map for multiple arguments
Brute force basic http authorization using httplib and multiprocessing

中的代码how to make HTTP in Python faster?