python新手。我正在编写一个刮刀,它产生一组值都具有unicode字符的值。
我想知道如何从中删除unicode字符。我的印象是我正在使用python3,但我无法分辨,因为命令是scrapy,我总是使用python2。从未使用过不使用python命令的工具。
FrutasVerduras frutasVerduras_datos[] = new FrutasVerduras[] {
}
要运行的命令是
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
如何从响应或已设置集中的项目中删除unicode字符?
答案 0 :(得分:1)
以这种方式尝试:
...
'text': quote.css('span.text::text').extract_first().decode('unicode_escape').encode('ascii', 'ignore')
...
答案 1 :(得分:0)
应该使用此代码
yield {
'text': quote.css('span.text::text').extract_first().encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore'),
'author': quote.css('small.author::text').extract_first().encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore'),
'tags': quote.css('div.tags a.tag::text').extract().encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore'),
}
或者您可以创建一个将unicode转换为字符串的函数,
def convertToString(encodedString):
return encodedString.encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore')
答案 2 :(得分:0)
我使用 Items 实现了相同的功能,以便为数据提供适当的结构,如下所示:Scrapy documentation
items.py
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()
蜘蛛
import scrapy
from ..items import QuotesItem
class QuotesSpiderSpider(scrapy.Spider):
name = 'quotes_spider'
# allowed_domains = ['http://quotes.toscrape.com/']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
items = QuotesItem()
all_div_quotes = response.css("div.quote")
for quote in all_div_quotes:
title = quote.css('span.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css("a.tag::text").extract()
items['title'] = title.strip(u'\u201c\u201d') #strip unicode chars u'\u201c\u201d
items['author'] = author
items['tag'] = ", ".join(str(x) for x in tags) #convert list to string separated by commas
yield items
要运行
scrapy crawl quotes_spider -o output.json