从Scrapy json输出中删除括号

时间:2016-05-21 15:29:56

标签: python json scrapy

我的代码的最后一部分是将我的scrapy管道中的数据加载到我的pandas数据帧中。

示例结果如下:

{"Message": ["\r\n", " Profanity directed toward staff.  ", "\r\n Profanity directed toward warden ", "  \r\n  "], "Desc": "https://www.tdcj.state.tx.us/death_row/dr_info/nicholsjoseph.jpg"}

当加载到数据帧时,[]括号在那里带有“\ r \ n”。快速搜索告诉我这是因为编码而且报废很常见。

有人能让我知道一种pythonic方式来获得更清洁的输出吗?

我期待像

这样的东西
{"Message: "Profanity directed toward staff. Profanity directed toward warden", "Desc": "https://www.tdcj.state.tx.us/death_row/dr_info/nicholsjoseph.jpg"}

编辑添加项目类和蜘蛛:

Item.py

from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join


class DeathItem(Item):

    firstName = Field()
    lastName = Field()
    Age = Field()
    Date = Field()
    Race = Field()
    County = Field()
    Message = Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join())
    Desc = Field()
    Mid = Field()

spider.py

from urlparse import urljoin
import scrapy
from texasdeath.items import DeathItem


class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]
    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()
            item['Mid'] = site.xpath('td[1]/text()').extract()
            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            urlLast = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())

            if url.endswith(("jpg","no_info_available.html")):
                item['Desc'] = url
                if urlLast.endswith("no_last_statement.html"):
                    item['Message'] = "No last statement"
                    yield item
                else:
                    request = scrapy.Request(urlLast, meta={"item" : item}, callback =self.parse_details2)
                    yield request
            else:        
                request = scrapy.Request(url, meta={"item": item,"urlLast" : urlLast}, callback=self.parse_details)
                yield request

    def parse_details(self, response):
        item = response.meta["item"]
        urlLast = response.meta["urlLast"]
        item['Desc'] = response.xpath("//*[@id='body']/p[3]/text()").extract()
        if urlLast.endswith("no_last_statement.html"):
            item["Message"] = "No last statement"
            return item
        else:
            request = scrapy.Request(urlLast, meta={"item": item}, callback=self.parse_details2)
            return request

    def parse_details2(self, response):
        item = response.meta["item"]
        item['Message'] = response.xpath("//div/p[contains(., 'Last Statement:')]/following-sibling::node()/descendant-or-self::text()").extract()
        return item

我基本上希望将干净文本中的输出加载到我的pandas数据帧中。但是所有不需要的字符,例如:[],\ r \ n \ t都要省略。

基本上是为了显示数据,例如在网络中。

1 个答案:

答案 0 :(得分:2)

您需要调整提取的项目字段进行后处理的方式。因为Scrapy具有带输入和输出处理器的Item Loaders。在您的情况下,您需要Join()MapCompose(unicode.strip)

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class MyItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

    message_in = MapCompose(unicode, unicode.strip)
    message_out = Join()