Scrapy的CsvItemExporter的自定义CSV标头

时间:2019-05-24 09:00:59

标签: python python-3.x scrapy

我正在尝试解析XML并将其转换为CSV。棘手的部分是标题应与第三方CSV解析器文档中指定的术语完全匹配,并且标题之间应包含空格,例如“项目标题”,“项目说明”等。

由于将Items定义为items.py中的变量,因此我无法创建包含空格(即

)的Items
Item title = scrapy.Field()

我尝试添加到settings.py:

FEED_EXPORT_FIELDS = ["Item title", "Item description"]

它会编辑CVS标头,但此后它不再与Items相匹配,因此不会将任何数据填充到.csv中。

    class MySpider(XMLFeedSpider):
        name = 'example'
        allowed_domains = ['example.com']
        start_urls = ['http://example.com/feed.xml']
        itertag = 'item'

        def parse_node(self, response, node):
            item = FeedItem()
            item['id'] = node.xpath('//*[name()="g:id"]/text()').get()
            item['title'] = node.xpath('//*[name()="g:title"]/text()').get()
            item['description'] = node.xpath('//*[name()="g:description"]/text()').get()

            return item

解析器工作正常,我得到了我需要的所有数据。问题只在于csv标头。

有没有一种方法可以轻松地添加与项目名称不匹配并且可以包含少量单词的自定义标题?

我当前得到的输出:

id, title, description
12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
12346, Quick Fox, The quick brown fox jumps over the lazy dog.

所需的输出应如下所示:

ID, Item title, Item description
12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
12346, Quick Fox, The quick brown fox jumps over the lazy dog.

测试输入

<rss>
<channel>
  <title>Example</title>
  <link>http://www.example.com</link>
  <description>Description of Example.com</description>
        <item>
            <g:id>12345</g:id>
            <g:title>Lorem Ipsum</g:title>
            <g:description>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</g:description>
        </item>
        <item>
            <g:id>12346</g:id>
            <g:title>Quick Fox</g:title>
            <g:description>The quick brown fox jumps over the lazy dog.</g:description>
        </item>
</channel>
</rss>

这是items.py的内容:

import scrapy

class FeedItem(scrapy.Item):
    id = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    pass

2 个答案:

答案 0 :(得分:1)

您可以创建自己的csv导出器!理想情况下,您可以使用其他方法扩展当前的导出器:

# exporters.py 
from scrapy.exporters import CsvItemExporter

class MyCsvItemExporter(CsvItemExporter):
    header_map = {
        'description': 'Item Description',
    }

    def _write_headers_and_set_fields_to_export(self, item):
        if not self.include_headers_line:
            return
        # this is the parent logic taken from parent class
        if not self.fields_to_export:
            if isinstance(item, dict):
                # for dicts try using fields of the first item
                self.fields_to_export = list(item.keys())
            else:
                # use fields declared in Item
                self.fields_to_export = list(item.fields.keys())
        headers = list(self._build_row(self.fields_to_export))

        # here we add our own extra mapping
        # map headers to our value
        headers = [self.header_map.get(header, header) for header in headers]
        self.csv_writer.writerow(headers)

然后在您的设置中将其激活:

FEED_EXPORTERS = {
    'csv': 'myproject.exporters.MyCsvItemExporter',
}

答案 1 :(得分:0)

您可以将内置字典dict类型用作项目,并将必需的csv标头值用作字典关键字:

    def parse_node(self, response, node):
        item = dict() #item = {}
        item['ID'] = node.xpath('//*[name()="g:id"]/text()').get()
        item['Item title'] = node.xpath('//*[name()="g:title"]/text()').get()
        item['Item description'] = node.xpath('//*[name()="g:description"]/text()').get()

        return item #yield item