Scrapy打印字段但不填充XML文件

时间:2015-04-24 22:19:43

标签: python xml xpath scrapy scrapy-spider

我有一个问题,它正确打印XML文件,但它没有用任何内容填充XML文件。

终端输出如下:

[u'Tove'] [u'Jani'] [u'Reminder'] [u"Don't forget me this weekend!"]

然而,输出site_products.xml导致了这个(这是错误的,没有数据):

<?xml version="1.0" encoding="utf-8"?>
<items></items>

spider.py

from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import CrawlerItem

class SiteSpider(XMLFeedSpider):
    name = 'site'
    allowed_domains = ['www.w3schools.com']
    start_urls = ['http://www.w3schools.com/xml/note.xml']
    itertag = 'note'

    def parse_node(self, response, selector):
        to = selector.xpath('//to/text()').extract()
        who = selector.xpath('//from/text()').extract()
        heading = selector.xpath('//heading/text()').extract()
        body = selector.xpath('//body/text()').extract()
        return item

pipelines.py

from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class XmlExportPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
         pipeline = cls()
         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
         return pipeline

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

items.py

import scrapy                                                                                           


class CrawlerItem(scrapy.Item):
    to = scrapy.Field()
    who = scrapy.Field()
    heading = scrapy.Field()
    body = scrapy.Field()
    pass

settings.py

BOT_NAME = 'crawler'                                                                                                                                                                                           
SPIDER_MODULES = ['crawler.spiders']                                                                    
NEWSPIDER_MODULE = 'crawler.spiders'
ITEM_PIPELINES = {'crawler.pipelines.XmlExportPipeline': 300,}

对此的任何帮助都将非常感激。

1 个答案:

答案 0 :(得分:1)

您需要在8 x 4 = 32方法中实例化CrawlerItem个实例:

parse_node()