Question

我正在尝试使用“{em> get_text ”函数从网页获取文本，如here所述。

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这适用于这个特定的网站，但当我试图从另一个网站上刮，我得到403错误：

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这会在行html = urllib.request.urlopen(url).read().decode('utf-8')中出现以下错误：

HTTPError: HTTP Error 403: Forbidden

我尝试通过指定用户代理来修复它，如下所示：

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)

但是我收到以下错误：

TypeError: urlopen() got an unexpected keyword argument 'headers'

由于错误headers未定义urlopen，我尝试使用requests模块指定用户代理，如下所示：

from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))

但这会产生以下错误：

AttributeError: 'Response' object has no attribute 'strip'

如何让这个该死的服务器停止阻止我的网络抓取？

Answer 1

您需要处理响应的主体，而不是响应对象本身：

response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(response.text))

如何解决因服务器阻止网页抓取而导致的这些错误？

1 个答案: