我的代码的最后一部分是将我的scrapy管道中的数据加载到我的pandas数据帧中。
示例结果如下:
{"Message": ["\r\n", " Profanity directed toward staff. ", "\r\n Profanity directed toward warden ", " \r\n "], "Desc": "https://www.tdcj.state.tx.us/death_row/dr_info/nicholsjoseph.jpg"}
当加载到数据帧时,[]括号在那里带有“\ r \ n”。快速搜索告诉我这是因为编码而且报废很常见。
有人能让我知道一种pythonic方式来获得更清洁的输出吗?
我期待像
这样的东西{"Message: "Profanity directed toward staff. Profanity directed toward warden", "Desc": "https://www.tdcj.state.tx.us/death_row/dr_info/nicholsjoseph.jpg"}
编辑添加项目类和蜘蛛:
Item.py
from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class DeathItem(Item):
firstName = Field()
lastName = Field()
Age = Field()
Date = Field()
Race = Field()
County = Field()
Message = Field(
input_processor=MapCompose(unicode.strip),
output_processor=Join())
Desc = Field()
Mid = Field()
spider.py
from urlparse import urljoin
import scrapy
from texasdeath.items import DeathItem
class DeathSpider(scrapy.Spider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]
def parse(self, response):
sites = response.xpath('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['Mid'] = site.xpath('td[1]/text()').extract()
item['firstName'] = site.xpath('td[5]/text()').extract()
item['lastName'] = site.xpath('td[4]/text()').extract()
item['Age'] = site.xpath('td[7]/text()').extract()
item['Date'] = site.xpath('td[8]/text()').extract()
item['Race'] = site.xpath('td[9]/text()').extract()
item['County'] = site.xpath('td[10]/text()').extract()
url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
urlLast = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())
if url.endswith(("jpg","no_info_available.html")):
item['Desc'] = url
if urlLast.endswith("no_last_statement.html"):
item['Message'] = "No last statement"
yield item
else:
request = scrapy.Request(urlLast, meta={"item" : item}, callback =self.parse_details2)
yield request
else:
request = scrapy.Request(url, meta={"item": item,"urlLast" : urlLast}, callback=self.parse_details)
yield request
def parse_details(self, response):
item = response.meta["item"]
urlLast = response.meta["urlLast"]
item['Desc'] = response.xpath("//*[@id='body']/p[3]/text()").extract()
if urlLast.endswith("no_last_statement.html"):
item["Message"] = "No last statement"
return item
else:
request = scrapy.Request(urlLast, meta={"item": item}, callback=self.parse_details2)
return request
def parse_details2(self, response):
item = response.meta["item"]
item['Message'] = response.xpath("//div/p[contains(., 'Last Statement:')]/following-sibling::node()/descendant-or-self::text()").extract()
return item
我基本上希望将干净文本中的输出加载到我的pandas数据帧中。但是所有不需要的字符,例如:[],\ r \ n \ t都要省略。
基本上是为了显示数据,例如在网络中。
答案 0 :(得分:2)
您需要调整提取的项目字段进行后处理的方式。因为Scrapy
具有带输入和输出处理器的Item Loaders。在您的情况下,您需要Join()
和MapCompose(unicode.strip)
:
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class MyItemLoader(ItemLoader):
default_output_processor = TakeFirst()
message_in = MapCompose(unicode, unicode.strip)
message_out = Join()