多处理BeautifulSoup bs4.element.Tag

时间:2015-08-08 01:45:20

标签: python-2.7 beautifulsoup pickle python-multiprocessing

我正在尝试将多处理与BeautifulSoup一起使用,但遇到maximum recursion depth exceeded错误:

def process_card(card):
    result = card.find("p")
    # Do some more parsing with beautifulsoup

    return results


pool = multiprocessing.Pool(processes=4)
soup = BeautifulSoup(url, 'html.parser')
cards = soup.findAll("li")
for card in cards:
    result = pool.apply_async(process_card, [card]) 
    article = result.get()
    if article is not None:
        print article
        articles.append(article)
pool.close()
pool.join()

根据我的收集,card类型为<class bs4.element.Tag>,问题可能与腌制此对象有关。我不清楚如何修改我的代码来解决这个问题。

1 个答案:

答案 0 :(得分:2)

评论中指出,人们可以简单地将card转换为unicode。但是,这导致process_card函数与slice indices must be integers or None or have an __index__ method错误输出。事实证明,此错误与card不再是bs4对象的事实有关,因此无法访问bs4函数。相反,card只是unicode,错误是与unicode相关的错误。因此,首先需要将card变为汤,然后从那里开始。这有效!

def process_card(unicode_card):
    card = BeautifulSoup(unicode_card)
    result = card.find("p")
    # Do some more parsing with beautifulsoup

    return results


pool = multiprocessing.Pool(processes=4)
soup = BeautifulSoup(url, 'html.parser')
cards = soup.findAll("li")
for card in cards:
    result = pool.apply_async(process_card, [unicode(card)]) 
    article = result.get()
    if article is not None:
        print article
        articles.append(article)
pool.close()
pool.join()