boilerpy3返回HTTP错误403:禁止

时间:2020-07-30 21:03:59

标签: python web-scraping http-error

在使用提取器的BoilerPy3中,在某些网站上,我收到“ HTTP错误403:禁止访问”。查看代码,看起来它调用urllib,并且只会采用不带标题的url。我该如何解决?

也许有人可以创建一个'boilerpy'标签?

from boilerpy3 import extractors
extractor = extractors.ArticleExtractor()

url = 'https://www.enca.com/south-africa/benghazi-hospital-security-tightened-following-car-bombing'
try:
    doc = extractor.get_doc_from_url(url)
except HTTPError as e:
    print (e)

1 个答案:

答案 0 :(得分:3)

与其尝试修改urllib调用,不如直接自己处理请求,例如requests库,然后使用结果调用BoilerPy3。例如:

import requests
from boilerpy3 import extractors


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/50.0.2661.102 Safari/537.36'
}
url = 'https://www.enca.com/south-africa/benghazi-hospital-security-tightened-following-car-bombing'
extractor = extractors.ArticleExtractor()

resp = requests.get(url, headers=headers)
if resp.ok:
    doc = extractor.get_content(resp.text)
else:
    raise Exception(f'Failed to get URL: {resp.status_code}')

这应该为您提供预期的文字。