我正试图从 this xml feed中删除作业数据。
我遇到了一个问题,当我启动我的蜘蛛时,我得到了一个有效的200 HTTP响应的start_url,但我没有抓取任何数据; 0页和0项被抓取。
我尝试迭代的节点是e:Entities
,其中包含e:Entity
包含作业数据的节点。
我真的不知道我做错了什么。我跟着scrapy guide on XMLFeedSpiders here走了一个发球台。
我怀疑它可能与XML组织严重有关,并且某种程度上与XML上的众多名称空间有关。我的命名空间有问题吗?
我几乎是肯定的,我选择了正确的iternodes值以及parse_node XPath选择器。
这是我的XMLFeedSpider代码。
class Schneider_XML_Spider(XMLFeedSpider):
name = "Schneider"
namespaces = [
('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
('e', 'http://www.taleo.com/ws/tee800/2009/01')
]
allowed_domains = ['http://schneiderjobs.com']
start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']
iterator ='iternodes' # use an XML iterator, called iternodes
itertag = 'e:Entities' # loop over "e:Entities" "e:Entity" nodes
# parse_node gets called on every node within itertag,
def parse_node(self, response, node):
print "we are now scraping"
item = Schneider_XML_Spider.itemFile.XMLScrapyPrototypeItem()
item['rid'] = node.xpath('e:Entity/e:ContestNumber').extract()
print item['rid']
return item
这是我的执行日志:
(注意:我在exe_log的重要部分周围放了一些空间。)
C:\PythonFiles\spidersClean\app\spiders\Scrapy\xmlScrapyPrototype\1.0\spiders>scrapy crawl Schneider
2015-02-18 10:31:46-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: schneiderXml)
2015-02-18 10:31:46-0500 [scrapy] INFO: Optional features available: ssl, http11
2015-02-18 10:31:46-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['spiders'], 'BOT_NAME': 'schneiderXml'}
2015-02-18 10:31:46-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMi
ddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled item pipelines:
2015-02-18 10:31:47-0500 [Schneider] INFO: Spider opened
2015-02-18 10:31:47-0500 [Schneider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-18 10:31:52-0500 [Schneider] DEBUG: Crawled (200) <GET http://schneiderjobs.com/driving-requisitions/scrape/1> (referer: None)
2015-02-18 10:31:52-0500 [Schneider] INFO: Closing spider (finished)
2015-02-18 10:31:52-0500 [Schneider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 245,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1360566,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 18, 15, 31, 52, 126000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 2, 18, 15, 31, 47, 89000)}
2015-02-18 10:31:52-0500 [Schneider] INFO: Spider closed (finished)
我已经查看了XMLFeedSpiders上的其他SO线程,但他们没有帮助 How to scrape xml feed with xmlfeedspider
How to scrape xml urls with scrapy
Why isn't XMLFeedSpider failing to iterate through the designated nodes?
以前有人解决过这样的问题吗?
答案 0 :(得分:0)
我已经想到了这一个!
我假设itertag
通过应用\\xml-node
xpath选择器来查找作为目标循环父节点的节点。实际上,情况并非如此。
您需要对要循环的itertag
使用显式XPath。在我的例子中,以下代码更改使我的蜘蛛功能:
class Schneider_XML_Spider(XMLFeedSpider):
name = "Schneider"
# get the links of all the namespaces so that the xml Selector knows how to handle each of them
namespaces = [
('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
('e', 'http://www.taleo.com/ws/tee800/2009/01')
]
# returns a scrapy.selector.Selector that process an XmlResponse
iterator = 'xml'
#point to the tag that contains all the inner nodes you want to process
itertag = "XMP/soap:Envelope/soap:Body/ns1:findPartialEntitiesResponse/root:Entities"
allowed_domains = ['http://schneiderjobs.com']
start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']
您必须确保在您的名称空间数组中定义了在您的itertag的XPath中找到的所有名称空间。
此外,如果您尝试在parse_node方法中获取节点的内部文本,请务必记住将/text()
添加到XPath提取器的末尾。例如:
item['rid'] = node.xpath('e:Entity/e:ContestNumber/text()').extract()
我希望这个答案对将来提出这个问题的任何人都有帮助。