Question

到目前为止，我的网络抓取工作正常，逻辑是抓住目标网站，根据内容，它可能会跳转到另一个网站以获取更多信息。

在第二个站点中，可能需要删除一个或多个有用的数据。因此，底线，.csv有时可能需要刮掉并保存更多列。

我在下面做了一个小代码来检查它，看起来第一个“yield”定义了列的数量。您可以通过更改＆lt;来测试它。到＆gt;在下面的比较中签名。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'Quotes'
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ('http://quotes.toscrape.com/',
                 )

def parse(self, response):
    quotes = response.xpath('//*[@class="quote"]')

    i = 0
    for quote in quotes:

        i = i + 1

        text = quote.xpath('.//*[@class="text"]/text()').extract_first()
        author = quote.xpath('.//*[@itemprop="author"]/text()').extract_first()
        tags = quote.xpath('.//*[@itemprop="keywords"]/@content').extract_first()

        if i <= 5: # change the < by > and evalute columns in the new log.
            yield{'Text': text,
                  'Author': author,
                  'Tags': tags,
                  'Extra key' : 'dummy!' }               
        else:            
            yield{'Text': text,
                  'Author': author,
                  'Tags': tags}

那么，当我的脚本启动时，如何在以后添加（如果可能的话）更多我不知道的列？

Scrapy动态地将列添加到.csv中的输出

0 个答案: