Question

我想废弃以下网站https://www.shopee.sg：

~$ scrapy shell https://www.shopee.sg

但是我收到了404错误：

[s]   request    <GET https://www.shopee.sg>
[s]   response   <404 https://shopee.sg/>

虽然urllib2可以打开同一个网址：

import urllib2
response = urllib2.urlopen('https://www.shopee.sg')
print len(response.read())

所示：

Answer 1

网站似乎检查用户代理字符串并阻止Scrapy。如果您将其设置为例如使用USER_AGENT的Chromium用户代理字符串，它可以工作：

scrapy shell -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36" "https://www.shopee.sg"

scrapy错误404但urllib2错误

1 个答案: