我正在尝试使用“{em> get_text ”函数从网页获取文本,如here所述。
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
这适用于这个特定的网站,但当我试图从另一个网站上刮,我得到403错误:
import urllib.request
from inscriptis import get_text
url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
这会在行html = urllib.request.urlopen(url).read().decode('utf-8')
中出现以下错误:
HTTPError: HTTP Error 403: Forbidden
我尝试通过指定用户代理来修复它,如下所示:
import urllib.request
from inscriptis import get_text
url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')
text = get_text(html)
print(text)
但是我收到以下错误:
TypeError: urlopen() got an unexpected keyword argument 'headers'
由于错误headers
未定义urlopen
,我尝试使用requests
模块指定用户代理,如下所示:
from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})
print(get_text(url))
但这会产生以下错误:
AttributeError: 'Response' object has no attribute 'strip'
如何让这个该死的服务器停止阻止我的网络抓取?
答案 0 :(得分:1)
您需要处理响应的主体,而不是响应对象本身:
response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})
print(get_text(response.text))