使用多处理程序运行脚本时会引发错误

时间:2018-11-21 16:43:13

标签: python python-3.x web-scraping multiprocessing openpyxl

我用python与BeautifulSoup结合编写了一个脚本,以提取书籍的标题,这些书籍的标题是在亚马逊搜索框中提供一些ISBN号后填充的。我正在从名为amazon.xlsx的excel文件中提供这些ISBN号。当我尝试使用以下脚本时,它将相应地解析标题并按预期写回excel文件。

The link where I put isbn numbers to populate the results

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook

wb = load_workbook('amazon.xlsx')
ws = wb['content']

def get_info(num):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': num
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)
    soup = BeautifulSoup(res.text,"lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        get_data(itemlink['href'])

def get_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError: itmtitle = "N\A"

    print(itmtitle)

    ws.cell(row=row, column=2).value = itmtitle
    wb.save("amazon.xlsx")

if __name__ == '__main__':
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row,column=1).value==None:break
        val = ws["A" + str(row)].value
        get_info(val)

但是,当我尝试使用multiprocessing执行相同操作时,出现以下错误:

ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined

对于multiprocessing,我在脚本中带来的变化是:

from multiprocessing import Pool

if __name__ == '__main__':
    isbnlist = []
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row,column=1).value==None:break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        p.map(get_info,isbnlist)
        p.terminate()
        p.join()

我尝试过的ISBN很少:

9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461

如何使用multiprocessing摆脱该错误并获得所需的结果?

1 个答案:

答案 0 :(得分:1)

row中引用全局变量get_data()毫无意义,因为

  1. 这是一个全局变量,不会在多处理池中的每个“线程”之间共享,因为它们实际上是不共享全局变量的单独的python进程。

  2. 即使这样做了,因为在执行get_info()之前要构建整个ISBN列表,由于循环已完成,row的值将始终为ws.max_row + 1

因此,您需要提供行值作为传递到p.map()的第二个参数的数据的一部分。但是即使这样做,由于Windows文件锁定,race conditions等原因,从多个进程写入电子表格并保存它也是一个坏主意。最好还是通过多处理来建立标题列表,然后将它们写完一次,如下所示:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pool


def get_info(isbn):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': isbn
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)
    soup = BeautifulSoup(res.text, "lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        return get_data(itemlink['href'])


def get_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError:
        itmtitle = "N\A"

    return itmtitle


def main():
    wb = load_workbook('amazon.xlsx')
    ws = wb['content']

    isbnlist = []
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row, column=1).value is None:
            break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        titles = p.map(get_info, isbnlist)
        p.terminate()
        p.join()

    for row in range(2, ws.max_row + 1):
        ws.cell(row=row, column=2).value = titles[row - 2]

    wb.save("amazon.xlsx")


if __name__ == '__main__':
    main()