如何解决因服务器阻止网页抓取而导致的这些错误?

时间:2018-05-21 07:54:36

标签: python web-crawler

我正在尝试使用“{em> get_text ”函数从网页获取文本,如here所述。

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这适用于这个特定的网站,但当我试图从另一个网站上刮,我得到403错误:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这会在行html = urllib.request.urlopen(url).read().decode('utf-8')中出现以下错误:

HTTPError: HTTP Error 403: Forbidden

我尝试通过指定用户代理来修复它,如下所示:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)

但是我收到以下错误:

TypeError: urlopen() got an unexpected keyword argument 'headers'

由于错误headers未定义urlopen,我尝试使用requests模块指定用户代理,如下所示:

from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))

但这会产生以下错误:

AttributeError: 'Response' object has no attribute 'strip'

如何让这个该死的服务器停止阻止我的网络抓取?

1 个答案:

答案 0 :(得分:1)

您需要处理响应的主体,而不是响应对象本身:

response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(response.text))