我在pythonanywhere上遇到过关于scrapy蜘蛛的奇怪问题。
我正在运行这些蜘蛛中的一些。我通过调度python脚本来启动它们,通过计划任务检查蜘蛛实例是否已经运行,如果没有,运行蜘蛛。脚本如下所示:
from tendo import singleton
me = singleton.SingleInstance()
import os
class cd:
def __init__(self, newPath):
self.newPath = os.path.expanduser(newPath)
def __enter__(self):
self.savedPath = os.getcwd()
os.chdir(self.newPath)
def __exit__(self, etype, value, traceback):
os.chdir(self.savedPath)
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
with cd("/home/username/notebooksbilliger"):
process = CrawlerProcess(get_project_settings())
process.crawl('notebooksbilliger')
process.start()
对于大多数蜘蛛,这个脚本运行得很好,但是对于一个特定的蜘蛛,当蜘蛛启动时,它总是抛出500个内部服务器错误,如下所示,这会导致蜘蛛中止:
2015-09-03 14:07:19 [scrapy] DEBUG: Retrying <GET http://www.notebooksbilliger.de/handys+smartphones/> (failed 1 times): 500 Internal Server Error
2015-09-03 14:08:13 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-09-03 14:08:31 [scrapy] DEBUG: Retrying <GET http://www.notebooksbilliger.de/handys+smartphones/> (failed 2 times): 500 Internal Server Error
2015-09-03 14:09:12 [scrapy] DEBUG: Gave up retrying <GET http://www.notebooksbilliger.de/handys+smartphones/> (failed 3 times): 500 Internal Server Error
2015-09-03 14:09:12 [scrapy] DEBUG: Crawled (500) <GET http://www.notebooksbilliger.de/handys+smartphones/> (referer: None)
2015-09-03 14:09:12 [scrapy] DEBUG: Ignoring response <500 http://www.notebooksbilliger.de/handys+smartphones/>: HTTP status code is not handled or not allowed
2015-09-03 14:09:12 [scrapy] INFO: Closing spider (finished)
如果我通过简单地调用&#34; scrapy crawl notebooksbilliger&#34;来运行蜘蛛。在shell中,一切正常。
有没有人知道为什么会发生这种情况或者能指出我如何找出这种情况的原因?