我使用scrapy 0.24从网站上抓取数据。但是,我无法通过回调方法parse_summary
发出任何请求。
class ExampleSpider(scrapy.Spider):
name = "tfrrs"
allowed_domains = ["example.org"]
start_urls = (
'http://www.example.org/results_search.html?page=0&sport=track&title=1&go=1',
)
def __init__(self, *args, **kwargs):
super(TfrrsSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.org/results_search.html?page=0&sport=track'&title=1&go=1',]
pass
# works without issue
def parse(self, response):
races = response.xpath("//table[@width='100%']").xpath(".//a[starts-with(@href, 'http://www.tfrrs.org/results/')]/@href").extract()
callback = self.parse_trackfieldsummary
for race in races:
yield scrapy.Request(race, callback=self.parse_summary)
pass
# works without issue
def parse_summary(self, response):
baseurl = 'http://www.example.org/results/'
results = response.xpath("//div[@class='data']").xpath('.//a[@style="margin-left: 20px;"]/@href').extract()
for result in results:
print(baseurl+result) # shows that url is correct everytime
yield scrapy.Request(baseurl+result, callback=self.parse_compiled)
# is never fired or shown in terminal
def parse_compiled(self, response):
print('test')
results = response.xpath("//table[@style='width: 935px;']")
print(results)
当我在parse_summary
中进行请求失败时(由于域错误等),我能够在提示中看到错误,但是当我使用正确的url时,它就好像我不是甚至称之为。我还在parse_summary
方法中测试了parse
中请求的网址,它们按预期工作。是什么原因导致他们不能在parse_summary
方法中被解雇但在parse method
成功?感谢您的帮助。
对Spider
进行一些更改后,我仍然会得到相同的结果。但是,如果我使用一个全新的项目,它就可以工作。所以我猜它与我的项目设置有关。
以下是我的项目设置(raceretrieval
是我项目的名称):
BOT_NAME = 'raceretrieval'
DOWNLOAD_DELAY= 1
CONCURRENT_REQUESTS = 100
SPIDER_MODULES = ['raceretrieval.spiders']
NEWSPIDER_MODULE = 'raceretrieval.spiders'
ITEM_PIPELINES = {
'raceretrieval.pipelines.RaceValidationPipeline':1,
'raceretrieval.pipelines.RaceDistanceValidationPipeline':2,
# 'raceretrieval.pipelines.RaceUploadPipeline':9999
}
如果我发表评论 DOWNLOAD_DELAY= 1
和CONCURRENT_REQUESTS = 100
,则蜘蛛按预期工作。为什么会这样?我不明白他们会如何影响这一点。
答案 0 :(得分:4)
我更正了一些拼写错误并正确设置了允许的域名,而parse_summary似乎运行正常。 提取网址并在终端中正确显示parse_compile结果。
输出行如下:
2014-12-29 12:19:05+0100 [example] DEBUG: Crawled (200) <GET
http://www.tfrrs.org/results/36288_f.html> (referer:
http://www.tfrrs.org/results/36288.html) <200
http://www.tfrrs.org/results/36288_f.html>
[<Selector xpath="//table[@style='width: 935px;']" data=u'<table width="0" border="0" cellspacing='>, <Selector xpath="//table[@style='width: 935px;']" data=u'<table width="0" border="0" cellspacing='> .....
这是更正后的代码:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["tfrrs.org"]
start_urls = (
'http://www.tfrrs.org/results_search.html?page=0&sport=track&title=1&go=1',
)
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.tfrrs.org/results_search.html?page=0&sport=track&title=1&go=1',]
# works without issue
def parse(self, response):
races = response.xpath("//table[@width='100%']").xpath(".//a[starts-with(@href, 'http://www.tfrrs.org/results/')]/@href").extract()
#callback = self.parse_trackfieldsummary
for race in races:
yield scrapy.Request(race, callback=self.parse_summary)
pass
# works without issue
def parse_summary(self, response):
baseurl = 'http://www.tfrrs.org/results/'
results = response.xpath("//div[@class='data']").xpath('.//a[@style="margin-left: 20px;"]/@href').extract()
for result in results:
#print(baseurl+result) # shows that url is correct everytime
yield scrapy.Request(baseurl+result, callback=self.parse_compiled)
# is never fired or shown in terminal
def parse_compiled(self, response):
print(response)
results = response.xpath("//table[@style='width: 935px;']")
print(results)
答案 1 :(得分:-1)
这些可以解决您遇到的问题
find / -type d -name "__pycache__" -delete 2>/dev/null
find / -name '*.pyc' -delete
find / -name '*.egg'
修改强>
如果它没有解决它,那么问题可能实际上是下载延迟实际上是在最后一个请求的最后一个请求,只是在很长一段时间内产生了^^