在使用提取器的BoilerPy3中,在某些网站上,我收到“ HTTP错误403:禁止访问”。查看代码,看起来它调用urllib,并且只会采用不带标题的url。我该如何解决?
也许有人可以创建一个'boilerpy'标签?
from boilerpy3 import extractors
extractor = extractors.ArticleExtractor()
url = 'https://www.enca.com/south-africa/benghazi-hospital-security-tightened-following-car-bombing'
try:
doc = extractor.get_doc_from_url(url)
except HTTPError as e:
print (e)
答案 0 :(得分:3)
与其尝试修改urllib
调用,不如直接自己处理请求,例如requests
库,然后使用结果调用BoilerPy3
。例如:
import requests
from boilerpy3 import extractors
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/50.0.2661.102 Safari/537.36'
}
url = 'https://www.enca.com/south-africa/benghazi-hospital-security-tightened-following-car-bombing'
extractor = extractors.ArticleExtractor()
resp = requests.get(url, headers=headers)
if resp.ok:
doc = extractor.get_content(resp.text)
else:
raise Exception(f'Failed to get URL: {resp.status_code}')
这应该为您提供预期的文字。