将Scraped数据写入带有斯拉夫字符的csv的问题(UnicodeEncodeError& TypeError)

时间:2017-01-18 21:31:39

标签: python csv encoding utf-8 scrapy

意图/通缉结果:

从捷克网站上删除链接标题(即每个项目的链接文本):

https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha

并将结果打印在CSV文件中。最好在列表中,以便我以后可以在另一个Python数据分析模型中操作数据。

结果/问题:

我收到了一个UnicodeEncodeError和一个TypeError。我怀疑这与捷克语中存在的非正常字符有关。请参阅下面的追溯。

追溯:

TypeError Traceback:

2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n                                Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
          b'tna, Praha - Kr\xc4\x8d                            '}
Traceback (most recent call last):
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
    self.exporter.export_item(item)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 193, in export_item
    self._write_headers_and_set_fields_to_export(item)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 217, in _write_headers_and_set_fields_to_export
    self.csv_writer.writerow(row)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 718, in write
    return self.writer.write(data)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 376, in write
    data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly

UnicodeEncodeError Traceback:

2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n                                Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
          b'tna, Praha - Kr\xc4\x8d                            '}
Traceback (most recent call last):
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
    self.exporter.export_item(item)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 198, in export_item
    self.csv_writer.writerow(values)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 37: character maps to <undefined>

情况/流程:

我正在运行scrapy crawl bezrealitky(即蜘蛛的名字)。我用互联网上找到的CSVItemExporter配置了管道,并在打开文件时尝试将其调整为UTF-8编码(我在开始时尝试过但没有添加UTF-8,但是同样的错误)。

我的管道代码:

from scrapy.conf import settings
from scrapy.exporters import CsvItemExporter
import codecs


class CsvPipeline(object):
    def __init__(self):
        self.file = codecs.open("booksdata.csv", 'wb', encoding='UTF-8')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

我的设置文件:

BOT_NAME = 'bezrealitky'

SPIDER_MODULES = ['bezrealitky.spiders']
NEWSPIDER_MODULE = 'bezrealitky.spiders'

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'bezrealitky.pipelines.CsvPipeline': 300,

我的蜘蛛代码:

class BezrealitkySpider(scrapy.Spider):
    name = 'bezrealitky'
    start_urls = [
        'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
    ]
    def parse(self, response):
        item = BezrealitkyItem()
        items = []
        for records in response.xpath('//*[starts-with(@class,"record")]'):
            item['title'] = response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8')
            items.append(item)
        return(items)

到目前为止尝试的解决方案:

  • 添加和删除.encode('utf-8)到extract()命令,以及在pipeline.py中,但它不起作用。
  • 还尝试添加# - - 编码:utf-8 - - 开头,不起作用
  • 我尝试在控制台中将python代码更改为utf-8:

    chcp 65001

    设置PYTHONIOENCODING = utf-8

结论:

我无法将抓取的数据写入CSV文件,CSV已创建,但其中没有任何内容。即使在shell中我可以看到数据被抓取但它没有被正确解码/编码并在写入文件之前抛出错误。

我是初学者,只是想拿起Scrapy。非常感谢我能得到的任何帮助!

1 个答案:

答案 0 :(得分:0)

我用来抓取捷克网站并避免此错误的是unidecode模块。这个模块的作用是Unicode文本的ASCII音译。

# -*- coding: utf-8 -*-
from unidecode import unidecode

class BezrealitkySpider(scrapy.Spider):
    name = 'bezrealitky'
    start_urls = [
        'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
    ]
    def parse(self, response):
        item = BezrealitkyItem()
        items = []
        for records in response.xpath('//*[starts-with(@class,"record")]'):
            item['title'] = unidecode(response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8'))
            items.append(item)
        return(items)

因为我使用ItemLoader我的代码看起来像这样:

# -*- coding: utf-8 -*-
from scrapy.loader import ItemLoader

class BaseItemLoader(ItemLoader):
    title_in = MapCompose(unidecode)