我想做的是每十秒钟运行一次抓取,并输出反映时间的csv。我以后想安排这些刮板,但是直到我能进行这项工作之前,我认为继续前进无益。这是我的蜘蛛。我的蜘蛛:
import scrapy
import datetime as dt
class EspnSpider(scrapy.Spider):
name = 'espn'
allowed_domains = ['games.espn.com']
start_urls = ['http://games.espn.com/ffl/leaders?&scoringPeriodId=1&seasonId=2018']
today = dt.datetime.today().strftime('%Y%m%d %H%M%S')
custom_settings = {
'FEED_URI': 'filename_{}.csv'.format(today),
'FEED_FORMAT': 'csv',
}
def parse(self, response):
# response.body.decode('utf8')
week = response.xpath('//select[@id="scoringPeriods"]//option[@selected = "selected"]//text()').extract_first()
table = response.xpath('//table[@id="playertable_0"]')
for player in table.css('tr[id]'):
item = {
'id': player.css('::attr(id)').extract_first(),
'tpos': player.css('td.playertablePlayerName::text').extract_first().replace(u'\xa0',u' ').split(',')[-1].strip(),
'name': player.css('a.flexpop::text').extract_first(),
'stats': player.css('td.playertableStat::text').extract()[:-1],
'week': week,
}
yield item
for next_page in response.xpath('//select[@id="scoringPeriods"]//option[contains(text(), "Week")]//@value')[1:].extract():
yield scrapy.Request(response.urljoin('leaders?'+ next_page), callback = self.parse)
我用来安排刮擦的代码在这里:
from twisted.internet import reactor
from espn import EspnSpider
from scrapy.crawler import CrawlerRunner
def run_crawl():
runner = CrawlerRunner({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
deferred = runner.crawl(EspnSpider)
deferred.addCallback(reactor.callLater, 10, run_crawl) #I believe that 10 is in seconds
return deferred
run_crawl()
reactor.run()
根据我的要求,第一个csv可以很好地抓取。但是它不会再次刮擦,只是保持空闲状态。有什么想法吗?谢谢。