尝试加载nepalstock.com时,scrapy给出500内部服务器错误

时间:2019-04-07 06:47:50

标签: python scrapy

当我尝试将URL http://nepalstock.com/todaysprice加载到scrapy shell时,它返回500内部服务器错误。为什么这个站点特别会抛出这样的错误?

我已经尝试加载其他站点,并且它们在shell中都可以正常加载。我也尝试过使用和不使用http来执行它们。

scrapy shell 'http://nepalstock.com'

2019-04-07 12:09:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-07 12:09:41 [scrapy.core.engine] INFO: Spider opened
2019-04-07 12:09:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://nepalstock.com/robots.txt> (failed 1 times): 500 Internal Server Error
2019-04-07 12:09:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://nepalstock.com/robots.txt> (failed 2 times): 500 Internal Server Error
2019-04-07 12:09:42 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://nepalstock.com/robots.txt> (failed 3 times): 500 Internal Server Error
2019-04-07 12:09:42 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://nepalstock.com/robots.txt> (referer: None)
2019-04-07 12:09:42 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://nepalstock.com> (failed 1 times): 500 Internal Server Error
2019-04-07 12:09:42 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://nepalstock.com> (failed 2 times): 500 Internal Server Error
2019-04-07 12:09:42 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://nepalstock.com> (failed 3 times): 500 Internal Server Error
2019-04-07 12:09:42 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://nepalstock.com> (referer: None)

1 个答案:

答案 0 :(得分:0)

  

为什么这个网站特别会抛出这样的错误?

User-Agent标头。

许多站点将对机器人通常使用的用户代理的请求做出错误响应。 Scrapy's default user agentScrapy/VERSION (+https://scrapy.org),但您可以为其设置另一个值。

$ scrapy shell
...
>>> req = scrapy.Request( 
    'http://nepalstock.com/todaysprice', 
    headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0'}, 
)                                                                                                                                                      
>>> fetch(req)                                                                                                                                         
2019-04-07 12:08:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://nepalstock.com/todaysprice> (referer: None)