Question

我最近开始使用Scrapy，我正在尝试清理一些我已经删除并想要导出为CSV的数据，即以下三个例子：

示例1 - 删除某些文字
示例2 - 删除/替换不需要的字符
示例3 -splitting逗号分隔文本

示例1数据如下：

我想要的文字，我不想要的文字

使用以下代码：

'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract()

示例2数据如下：

- 但我想将其更改为£

使用以下代码：

' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()

示例3数据如下所示：

第1项，第2项，第3项，第4项，第4项，第5项 - 最终我想分开这是CSV文件中的单独列

使用以下代码：

' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()

我尝试过使用str.replace()，但似乎无法使用它，例如： 'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want","")）

我正在研究这个问题，但是如果有人能指出我正确的方向，我会感激不尽！

以下代码：

import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product


class QuotesSpider(scrapy.Spider):
    name = "quotes_product"
    start_urls = [
        'http://www.unitestudents.com/',
            ]

    # Step 1
    def parse(self, response):
        for city in response.xpath('//select[@id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
            yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)

    # Step 2
    def parse_citypage(self, response):
        for url in response.xpath('//div[@class="property-header"]/h3/span/a/@href').extract(): #Select for each property the url
            yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)


    # Step 3
    def parse_unitpage(self, response):
        for final in response.xpath('//div/div/div[@class="content__btn"]/a/@href').extract(): #Select final page for data scrape
            yield scrapy.Request(response.urljoin(final), callback=self.parse_final)

    #Step 4 
    def parse_final(self, response):
        unitTypes = response.xpath('//html/body/div').extract()
        for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
            l = ItemLoader(item=Product(), response=response)
            l.add_xpath('area_name', '//div/ul/li/a/span/text()')
            l.add_xpath('type', '//div/div/div/h1/span/text()')
            l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
            l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
            l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
            l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
            return l.load_item()

但是，我得到以下内容？

value = self.item.fields[field_name].get(key, default)
KeyError: 'type'

Answer 1

如果您提供了蜘蛛和项目定义，那么提供更具体的答案要容易得多。以下是一些通用指南。

如果您想保持模块化并遵循Scrapy的建议项目架构和关注点分离，您应该通过Item Loaders input and output processors清理和准备数据以便进一步导出。

对于前两个示例，MapCompose看起来很合适。

Answer 2

你对str.replace有正确的想法，尽管我会建议Python＆＃39; re＆＃39;正则表达式库，因为它更强大。文档是一流的，您可以在那里找到一些有用的代码示例。

我不熟悉scrapy库，但看起来.extract()会返回一个字符串列表。如果要使用str.replace或其中一个正则表达式函数转换它们，则需要使用列表推导：

'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]

编辑：关于单独的列 - 如果数据已经以逗号分隔，只需将其直接写入文件即可！如果要拆分以逗号分隔的数据以进行某些转换，可以像这样使用str.split：

"A,B,C".split(",") # returns [ "A", "B", "C" ]

在这种情况下，从.extract()返回的数据将是以逗号分隔的字符串列表。如果你使用上面的列表推导，你最终会得到一个列表列表。

如果你想要比分割每个逗号更复杂的东西，你可以使用python的csv库。

使用Scrapy清理数据

2 个答案: