Question

我正在使用Scrapy Pipeline将所有项目获取到数据框。

代码运行良好，但unicode文本未正确显示在数据帧的输出上。

但是feed_exporter导出的csv文件中的结果仍然可以。你们能请教吗？

这是代码

#In pipelines.py
class CrawlerPipeline(object):
    def open_spider(self, spider):
        settings = get_project_settings()
        self.df = pd.DataFrame(columns=settings.get('FEED_EXPORT_FIELDS'))
        print('SUCCESS CREATE DATAFRAME', self.df.columns)


    def process_item(self, item, spider):
        self.df = self.df.append([dict(item)]) #I think it has problem in this line of code
        print('SUCCESS APPEND RECORD TO DATAFRAME, DF LEN:', len(self.df))
        return item

#In spider.py
def parse_detail_page(self, response):
    ads = CrawlerItem()
    ads['body'] = (response.css('#sgg > div > div>  div.car_des > div::text').extract_first() or "").encode('utf-8').strip()
    yield(ads)

这是抓取的文本的错误输出：

b'Salon \ xc3 \ xb4 t \ xc3 \ xb4 \ xc3 \ x81nh L \ xc3 \ xbd b \ xc3 \ xa1n xe Kia Carens s \ xe1 \ xba \ xa3n xu \ xe1 \ xba \ xa5t 2015 m \ xc3 \ xa0u c \ xc3 \ xa1t'

Answer 1

您提到的不正确输出是与所需文本字符串相对应的UTF-8编码字节字符串。

您有两个选择：

从您的代码中删除.encode('utf-8')。
从数据帧读取字符串时添加.decode('utf-8')。

追加到数据框时，Scrapy项目不返回unicode？

1 个答案: