Question

我有一只蜘蛛（下图）我希望能够每10天左右通过一次Cron工作运行它，但是，每次我第一次运行它之后都会这样做。它会重写字段，而不是仅将项目附加到CSV中的相应字段。我怎样才能这样做，以便我在顶部只有一组字段标题，而不管我运行多少次都在它下面的所有数据？

import scrapy

class Wotd(scrapy.Item):
    word = scrapy.Field()
    definition = scrapy.Field()
    sentence = scrapy.Field()
    translation = scrapy.Field()


class WotdSpider(scrapy.Spider):
    name = 'wotd'
    allowed_domains = ['www.spanishdict.com/wordoftheday']
    start_urls = ['http://www.spanishdict.com/wordoftheday/']
    custom_settings = {
        #specifies exported fields and their order
    'FEED_EXPORT_FIELDS': ['word','definition','sentence','translation']
    }

def parse(self, response):
    jobs = response.xpath('//div[@class="sd-wotd-text"]')
    for job in jobs:
        item = Wotd()
        item['word'] = job.xpath('.//a[@class="sd-wotd-headword-link"]/text()').extract_first()
        item['definition'] = job.xpath('.//div[@class="sd-wotd-translation"]/text()').extract_first()
        item['sentence'] = job.xpath('.//div[@class="sd-wotd-example-source"]/text()').extract_first()
        item['translation'] = job.xpath('.//div[@class="sd-wotd-example-translation"]/text()').extract_first()
        yield item

从我在Scrapy文档中读到的内容看起来我可能不得不使用CsvItemExporter类并设置include_headers_line = False但我不确定在项目结构中将该类添加到何处。

Answer 1

首先，您没有分享您当前的项目结构，因此很难在具体示例中建议将其放在何处。

我们假设您的项目名为my_project。在主项目目录（包含settings.py的目录）下，使用以下内容创建文件exporters.py：

import scrapy.exporters

class NoHeaderCsvItemExporter(scrapy.exporters.CsvItemExporter):
    def __init__(self, file, join_multivalued=', ', **kwargs):
        super(NoHeaderCsvItemExporter, self).__init__(file=file, include_headers_line=False, join_multivalued=join_multivalued, **kwargs)

类NoHeaderCsvItemExporter继承自标准CSV导出器，只是指定我们不希望输出中包含标题行。

接下来，您必须在settings.py或蜘蛛网custom_settings中指定CSV格式的新导出程序类。按照当前的方法使用后面的选项，它将是：

custom_settings = {
    'FEED_EXPORT_FIELDS': ['word','definition','sentence','translation'],
    'FEED_EXPORTERS': {
        'csv': 'my_project.exporters.NoHeaderCsvItemExporter',
    }
}

请注意，使用此类时，CSV中不会包含任何标题行，即使是第一次导出也不会。

Scrapy CSV输出重复字段

1 个答案: