Question

我正在抓取网站赔率，只是对标题文本的简单查询会返回['OddsPortal：未找到页面']，但是在浏览器控制台中，此['OddsPortal：未找到页面']不会出现。我注意到外壳加载时的响应是：

[s]   response   <404 https://www.oddsportal.com/darts/europe/european-championship/results/>

在我的终端机

scrapy shell 'https://www.oddsportal.com/darts/europe/european-championship/results/' --set="ROBOTSTXT_OBEY=False"

response.css('title::text').extract()
['OddsPortal: Page not found']

我期望上面的选择器：

欧洲冠军赛的成绩和历史赔率，飞镖欧洲档案馆

Answer 1

在运行自己的请求时，我也会收到此错误。如here所示，此网站不允许抓取。我的猜测是，他们有一些保护措施来阻止您尝试。我将硒的非无头版本成功使用。我建议您以这种方式进行抓取。看起来大多数网站都是动态javascript，因此硒的另一个+1。在此示例中，我使用“美丽汤”进行解析，我强烈建议您使用它。

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://www.oddsportal.com/darts/europe/european-championship/results/')
soup = BeautifulSoup(driver.page_source, 'html.parser')
print(soup.title.text)

#output
#European Championship Results & Historical Odds, Darts Europe Archive

scrapy shell中的404响应，浏览器中的不同结果

1 个答案: