为XML输出序列化Scrapy项

时间:2017-09-22 19:03:14

标签: python xml scrapy xml-serialization

我是scrapy的新手,我正在寻找一种序列化Scrapy项目的方法,以便能够为我的XML输出添加属性,让它看起来像这样:

<field name='example'> i have some data scraped here </field>

我试图找到一种方法来添加&#34;名称&#34;例如,属性。我知道可以通过覆盖XmlItemExporter类中的export_item()方法来实现,但到目前为止我还没有运气。到目前为止,我的XmlExportPipeline看起来像这样:

from scrapy.exporters import XmlItemExporter

class XmlExportPipeline(object):

def open_spider(self, spider):
    self.file = open('%s_products.xml' % spider.name, 'w+b')
    self.exporter = XmlItemExporter(self.file, item_element='field', root_element='items')
    self.exporter.start_exporting()

def close_spider(self, spider):
    self.exporter.finish_exporting()
    self.file.close()

def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

此外,到目前为止,我的所有数据都是我的项目的不同字段,但理想情况下我会将其中一些字段作为其他字段的属性。

1 个答案:

答案 0 :(得分:0)

您只需更改XMLItemExporter并创建自定义的exporters.py即可。在项目中创建import six from scrapy.exporters import XmlItemExporter from scrapy.utils.python import is_listlike class MyXmlExportPipeline(XmlItemExporter): def _export_xml_field(self, name, serialized_value, depth): self._beautify_indent(depth=depth) self.xg.startElement("field", {"name": name}) if hasattr(serialized_value, 'items'): self._beautify_newline() for subname, value in serialized_value.items(): self._export_xml_field(subname, value, depth=depth+1) self._beautify_indent(depth=depth) elif is_listlike(serialized_value): self._beautify_newline() for value in serialized_value: self._export_xml_field('value', value, depth=depth+1) self._beautify_indent(depth=depth) elif isinstance(serialized_value, six.text_type): self._xg_characters(serialized_value) else: self._xg_characters(str(serialized_value)) self.xg.endElement("field") self._beautify_newline() 并添加以下代码

self.xg.startElement(name, {})
....
self.xg.endElement(name)

我做的唯一两项改变是改变

self.xg.startElement("field", {"name" :name})
....
self.xg.endElement("field")

从原始出口到

settings.py

然后更新您的FEED_EXPORTERS = { 'xml': 'so.exporters.MyXmlExportPipeline' } 并添加

class XMLExport(Spider):
    name = "xml"

    start_urls = ["http://www.tarunlalwani.com"]

    def parse(self, response):
        yield {"first_name": "tarun", "last_name": "lalwani"}

    pass

然后我创建了一个简单的刮刀来测试输出

scrapy crawl xml -o test.xml

使用<?xml version="1.0" encoding="utf-8"?> <items> <item><field name="first_name">tarun</field><field name="last_name">lalwani</field></item> </items> 测试它,输出XML文件是

$scope.positionindifferentplaces = [ {
    "DeviceName" : "Device 1",
    "DeviceID" : "10000005",
    "Date" : "2017-09-22T03:35:38-05:00",
    "Latitude" : 12.9716,
    "Longitude" : 77.5946,
    "Type" : "GPS",
    "Speed(mph)" : 64,
    "Speed(km/h)" : 103,
    "Altitude(ft)" : 68,
    "Altitude(m)" : 21,
    "Accuracy" : 5
}, {
    "DeviceName" : "Device 2",
    "DeviceID" : "10000005",
    "Date" : "2017-09-22T03:35:38-05:00",
    "Latitude" : 17.3850,
    "Longitude" : 78.4867,
    "Type" : "GPS",
    "Speed(mph)" : 64,
    "Speed(km/h)" : 103,
    "Altitude(ft)" : 68,
    "Altitude(m)" : 21,
    "Accuracy" : 5
}, {
    "DeviceName" : "Device 3",
    "DeviceID" : "10000005",
    "Date" : "2017-09-22T03:35:38-05:00",
    "Latitude" : 21.2514,
    "Longitude" : 81.6296,
    "Type" : "GPS",
    "Speed(mph)" : 64,
    "Speed(km/h)" : 103,
    "Altitude(ft)" : 68,
    "Altitude(m)" : 21,
    "Accuracy" : 5
}, {
    "DeviceName" : "Device 4",
    "DeviceID" : "10000005",
    "Date" : "2017-09-22T03:35:38-05:00",
    "Latitude" : 28.7041,
    "Longitude" : 77.1025,
    "Type" : "GPS",
    "Speed(mph)" : 64,
    "Speed(km/h)" : 103,
    "Altitude(ft)" : 68,
    "Altitude(m)" : 21,
    "Accuracy" : 5
}]