Python:多处理和请求构建一个简单的Web scraper

时间:2017-12-05 11:36:21

标签: python multiprocessing python-requests screen-scraping

我正在编写一个非常简单的网络刮刀来下载社会科学项目的图片。到目前为止,刮刀工作得很好,但我想知道是否有办法使用多处理加速它。

到目前为止我做了什么

from multiprocessing import cpu_count
import multiprocessing
from multiprocessing import Pool
import csv
import urllib.request
import pandas as pd
import os, errno

df = pd.read_csv('/Path/to/source-file.csv')
urls = [x for y in df.values.tolist() for x in y]

links = urls[10::6]
names = urls[6::6]
location = urls[7::6]
date = urls[8::6]
gale = urls[12::6]

def picture_getter(picture):
    global page     #page number in article
    global article     #how many articles it has iterated through
    global page_num_total     #how many pages overall it iterated through
# for i in links:
    print('working on page {} in article {}'.format(page, article))
    page = page + 1
    try:
        os.makedirs("/New/directory/for-each/Article/{}/{}/{}".format(location[page_num],date[page_num],gale[page_num]))
        page = 1
        article = article + 1
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise
    urllib.request.urlretrieve(picture, "/PAth/to/new-page/of/Articleinimageform{}/{}/{}/{}{}.jpg".format(location[page_num],date[page_num],gale[page_num], names[page_num], str(page)))
    page_num = page_num + 1

if __name__ == '__main__':
    page_num = 0
    page = 0
    god_damnit = 1
    pool = Pool(4)
    pool.map(picture_getter, links)

当我运行时,我会得到以下内容:

output
working on page 0 in article 1
working on page 0 in article 1
working on page 0 in article 1
working on page 0 in article 1
working on page 1 in article 1
None
working on page 1 in article 1
working on page 1 in article 1
working on page 1 in article 1
None
None
None
working on page 2 in article 1
working on page 2 in article 1
working on page 2 in article 1
None
working on page 2 in article 1
None
None
None
working on page 3 in article 1
working on page 3 in article 1
working on page 3 in article 1
working on page 3 in article 1
None
None
None
None

我假设不是让每个流程都在不同的文章上工作,而是所有这些流程同时处理同一篇文章。因此,销工厂与某些分工之间的区别与每个工人执行销的每个部分的工厂之间的区别。这不是我想要的,但我假设我遍布我的函数中的源列表的方式是责备这里。我想要的更像是:

worker process 1: working on page 1 article 1
worker process 2: working on page 2 article 1
worker process 3: working on page 1 article 2
worker process 4: working on page 1 article 3
worker process 1: working on page 2 article 3
worker process 2: working on page 3 article 3
...

我有几个问题。 我正确使用地图和池功能吗? 有没有更好的方法来遍历我的列表?它适用于正常的循环 我正确读取输出吗?这些流程实际上是否始终在同一页面上工作?

在这里完成业余。

0 个答案:

没有答案