Scrapy XMLFeedSpider 200响应代码但无法迭代选定的节点

时间:2015-02-18 16:22:58

标签: python xml xpath xml-parsing scrapy

我正试图从 this xml feed中删除作业数据。

我遇到了一个问题,当我启动我的蜘蛛时,我得到了一个有效的200 HTTP响应的start_url,但我没有抓取任何数据; 0页和0项被抓取。

我尝试迭代的节点是e:Entities,其中包含e:Entity包含作业数据的节点。

我真的不知道我做错了什么。我跟着scrapy guide on XMLFeedSpiders here走了一个发球台。

我怀疑它可能与XML组织严重有关,并且某种程度上与XML上的众多名称空间有关。我的命名空间有问题吗?

我几乎是肯定的,我选择了正确的iternodes值以及parse_node XPath选择器。

这是我的XMLFeedSpider代码。

class Schneider_XML_Spider(XMLFeedSpider):
    name = "Schneider"
    namespaces = [
        ('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
        ('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
        ('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
        ('e', 'http://www.taleo.com/ws/tee800/2009/01')
    ]

    allowed_domains = ['http://schneiderjobs.com']
    start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']    
    iterator ='iternodes' # use an XML iterator, called iternodes

    itertag = 'e:Entities' # loop over "e:Entities" "e:Entity" nodes

# parse_node gets called on every node within itertag,
def parse_node(self, response, node):
    print "we are now scraping"

    item = Schneider_XML_Spider.itemFile.XMLScrapyPrototypeItem()

    item['rid'] = node.xpath('e:Entity/e:ContestNumber').extract()
    print item['rid']
    return item

这是我的执行日志:

(注意:我在exe_log的重要部分周围放了一些空间。)

    C:\PythonFiles\spidersClean\app\spiders\Scrapy\xmlScrapyPrototype\1.0\spiders>scrapy crawl Schneider
2015-02-18 10:31:46-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: schneiderXml)
2015-02-18 10:31:46-0500 [scrapy] INFO: Optional features available: ssl, http11
2015-02-18 10:31:46-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['spiders'], 'BOT_NAME': 'schneiderXml'}
2015-02-18 10:31:46-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMi
ddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled item pipelines:

2015-02-18 10:31:47-0500 [Schneider] INFO: Spider opened
2015-02-18 10:31:47-0500 [Schneider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-18 10:31:52-0500 [Schneider] DEBUG: Crawled (200) <GET http://schneiderjobs.com/driving-requisitions/scrape/1> (referer: None)
2015-02-18 10:31:52-0500 [Schneider] INFO: Closing spider (finished)
2015-02-18 10:31:52-0500 [Schneider] INFO: Dumping Scrapy stats:

        {'downloader/request_bytes': 245,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 1360566,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 2, 18, 15, 31, 52, 126000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 2, 18, 15, 31, 47, 89000)}
2015-02-18 10:31:52-0500 [Schneider] INFO: Spider closed (finished)

我已经查看了XMLFeedSpiders上的其他SO线程,但他们没有帮助 How to scrape xml feed with xmlfeedspider

How to scrape xml urls with scrapy

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

以前有人解决过这样的问题吗?

1 个答案:

答案 0 :(得分:0)

我已经想到了这一个!

我假设itertag通过应用\\xml-node xpath选择器来查找作为目标循环父节点的节点。实际上,情况并非如此。

您需要对要循环的itertag使用显式XPath。在我的例子中,以下代码更改使我的蜘蛛功能:

class Schneider_XML_Spider(XMLFeedSpider):

name = "Schneider"

# get the links of all the namespaces so that the xml Selector knows how to handle each of them
namespaces = [
    ('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
    ('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
    ('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
    ('e', 'http://www.taleo.com/ws/tee800/2009/01')
]

# returns a scrapy.selector.Selector that process an XmlResponse
iterator = 'xml'

#point to the tag that  contains all the inner nodes you want to process
itertag = "XMP/soap:Envelope/soap:Body/ns1:findPartialEntitiesResponse/root:Entities"

allowed_domains = ['http://schneiderjobs.com']
start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']

您必须确保在您的名称空间数组中定义了在您的itertag的XPath中找到的所有名称空间。

此外,如果您尝试在parse_node方法中获取节点的内部文本,请务必记住将/text()添加到XPath提取器的末尾。例如:

item['rid'] = node.xpath('e:Entity/e:ContestNumber/text()').extract()

我希望这个答案对将来提出这个问题的任何人都有帮助。