我在使用Scrapy 0.22.2 for Python 2.7.3中使CrawlSpider规则正常工作时遇到了一些问题。在我看来,无论我做什么,我的规则中指定的回调方法永远不会被触发。我已经尝试设置我的规则以允许所有内容,但是我的回调方法(parse_units)中的print方法从来没有被调用过。为了这个目的,我确保不要尝试覆盖解析方法,因为这似乎是一个常见的错误。
我希望抓取工具跟进并解析的各个页面的链接如下所示 http://training.gov.au/Training/Details/BSBWOR204A
这是我的python代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
import urllib2
import os
qualification_code = raw_input( "enter the qualification code\n");
class trainingGovSpider(CrawlSpider):
name="trainingGov"
allowed_domains = ["training.gov.au"]
# refine this to accept user input
start_urls = ["http://training.gov.au/Training/Details/"+qualification_code+"?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100"]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//a[contains(@title, "View details for unit code")]')), callback='parse_units')]
def parse(self, response):
sel = Selector(response)
# first get the unit outline
qual_outline = sel.xpath('//a[contains(@title, "Download Qualification in PDF format.")]')
qual_outline_link = qual_outline.xpath('@href').extract()[0]
url = "http://training.gov.au" + qual_outline_link
#get url to qualification outline
print url
# next, get link elements which point to each individual unit in the qualification
sites = sel.xpath('//a[contains(@title, "View details for unit code")]')
for site in sites:
title = site.xpath('text()').extract()
# will need to follow each link
link = site.xpath('@href').extract()[0]
# will need to combine with http://training.gov.au
print link
#this should allow all individual unit links in a qualification to be parsed, but it isn't being called
def parse_units(self, response):
print 'parse_unit called'
hxs = HtmlXPathSelector(response)
current_status = hxs.xpath('text()').extract()
#current_status = hxs.xpath('/html/body/div/div/div/div/div/div[@class="outer"]/div[@class="fieldset"]/div[@class="display-row"]/div[@class="display-row"]/div[@class="display-field"]/span[re:test(., "Current", "i")]').extract()
print current_status
日志文件如下:
scrapy crawl trainingGov
enter the qualification code
CHC52212
2014-04-17 15:30:05+0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-04-17 15:30:05+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-04-17 15:30:05+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-04-17 15:30:05+0800 [scrapy] INFO: Enabled item pipelines:
2014-04-17 15:30:05+0800 [trainingGov] INFO: Spider opened
2014-04-17 15:30:05+0800 [trainingGov] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-04-17 15:30:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6024
2014-04-17 15:30:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2014-04-17 15:30:06+0800 [trainingGov] DEBUG: Crawled (200) <GET http://training.gov.au/Training/Details/CHC52212?tableUnits-page=1&pageSizeKey=Training_Details_tableUnits&pageSize=100> (referer: None)
http://training.gov.au/TrainingComponentFiles/CHC08/CHC52212_R1.pdf
/Training/Details/BSBSUS501A
/Training/Details/BSBWOR403A
/Training/Details/CHCAC317A
/Training/Details/CHCAC318B
/Training/Details/CHCAC416A
/Training/Details/CHCAC417A
/Training/Details/CHCAC507E
/Training/Details/CHCAD402D
/Training/Details/CHCAD504B
/Training/Details/CHCADMIN508B
/Training/Details/CHCAL523D
/Training/Details/CHCAOD402B
/Training/Details/CHCCD401E
/Training/Details/CHCCD412B
/Training/Details/CHCCD516B
/Training/Details/CHCCH301C
/Training/Details/CHCCH427B
/Training/Details/CHCCHILD401B
/Training/Details/CHCCHILD505B
/Training/Details/CHCCOM403A
/Training/Details/CHCCOM504B
/Training/Details/CHCCS426B
/Training/Details/CHCCS427B
/Training/Details/CHCCS502C
/Training/Details/CHCCS503B
/Training/Details/CHCCS505B
/Training/Details/CHCCS512C
/Training/Details/CHCCS513C
/Training/Details/CHCDIS301C
/Training/Details/CHCDIS410A
/Training/Details/CHCDIS507C
/Training/Details/CHCES311B
/Training/Details/CHCES415A
/Training/Details/CHCES502C
/Training/Details/CHCES511B
/Training/Details/CHCHC401C
/Training/Details/CHCICS409A
/Training/Details/CHCINF407D
/Training/Details/CHCINF408C
/Training/Details/CHCINF505D
/Training/Details/CHCMH301C
/Training/Details/CHCMH402B
/Training/Details/CHCMH411A
/Training/Details/CHCNET501C
/Training/Details/CHCNET503D
/Training/Details/CHCORG405E
/Training/Details/CHCORG406C
/Training/Details/CHCORG423C
/Training/Details/CHCORG428A
/Training/Details/CHCORG501B
/Training/Details/CHCORG506E
/Training/Details/CHCORG525D
/Training/Details/CHCORG607D
/Training/Details/CHCORG610B
/Training/Details/CHCORG611C
/Training/Details/CHCPA402B
/Training/Details/CHCPR510B
/Training/Details/CHCSD512C
/Training/Details/CHCSW401A
/Training/Details/CHCSW402B
/Training/Details/CHCYTH401B
/Training/Details/CHCYTH402C
/Training/Details/CHCYTH506B
/Training/Details/HLTFA311A
/Training/Details/HLTFA412A
/Training/Details/HLTHIR403C
/Training/Details/HLTHIR404D
/Training/Details/HLTWHS401A
/Training/Details/PSPGOV517A
/Training/Details/PSPMNGT605B
/Training/Details/SISCCRD302A
/Training/Details/SRXGOV004B
/Training/Details/TAEDEL402A
2014-04-17 15:30:06+0800 [trainingGov] INFO: Closing spider (finished)
2014-04-17 15:30:06+0800 [trainingGov] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 310,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 17347,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 4, 17, 7, 30, 6, 96000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 4, 17, 7, 30, 5, 522000)}
2014-04-17 15:30:06+0800 [trainingGov] INFO: Spider closed (finished)
谢谢!