意图/通缉结果:
从捷克网站上删除链接标题(即每个项目的链接文本):
https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha
并将结果打印在CSV文件中。最好在列表中,以便我以后可以在另一个Python数据分析模型中操作数据。
结果/问题:
我收到了一个UnicodeEncodeError和一个TypeError。我怀疑这与捷克语中存在的非正常字符有关。请参阅下面的追溯。
追溯:
TypeError Traceback:
2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
b'tna, Praha - Kr\xc4\x8d '}
Traceback (most recent call last):
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
self.exporter.export_item(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 193, in export_item
self._write_headers_and_set_fields_to_export(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 217, in _write_headers_and_set_fields_to_export
self.csv_writer.writerow(row)
File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 718, in write
return self.writer.write(data)
File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 376, in write
data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly
UnicodeEncodeError Traceback:
2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
b'tna, Praha - Kr\xc4\x8d '}
Traceback (most recent call last):
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
self.exporter.export_item(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 198, in export_item
self.csv_writer.writerow(values)
File "C:\Users\phili\Anaconda3\envs\py35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 37: character maps to <undefined>
情况/流程:
我正在运行scrapy crawl bezrealitky(即蜘蛛的名字)。我用互联网上找到的CSVItemExporter配置了管道,并在打开文件时尝试将其调整为UTF-8编码(我在开始时尝试过但没有添加UTF-8,但是同样的错误)。
我的管道代码:
from scrapy.conf import settings
from scrapy.exporters import CsvItemExporter
import codecs
class CsvPipeline(object):
def __init__(self):
self.file = codecs.open("booksdata.csv", 'wb', encoding='UTF-8')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
我的设置文件:
BOT_NAME = 'bezrealitky'
SPIDER_MODULES = ['bezrealitky.spiders']
NEWSPIDER_MODULE = 'bezrealitky.spiders'
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bezrealitky.pipelines.CsvPipeline': 300,
我的蜘蛛代码:
class BezrealitkySpider(scrapy.Spider):
name = 'bezrealitky'
start_urls = [
'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
]
def parse(self, response):
item = BezrealitkyItem()
items = []
for records in response.xpath('//*[starts-with(@class,"record")]'):
item['title'] = response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8')
items.append(item)
return(items)
到目前为止尝试的解决方案:
我尝试在控制台中将python代码更改为utf-8:
chcp 65001
设置PYTHONIOENCODING = utf-8
结论:
我无法将抓取的数据写入CSV文件,CSV已创建,但其中没有任何内容。即使在shell中我可以看到数据被抓取但它没有被正确解码/编码并在写入文件之前抛出错误。
我是初学者,只是想拿起Scrapy。非常感谢我能得到的任何帮助!
答案 0 :(得分:0)
我用来抓取捷克网站并避免此错误的是unidecode模块。这个模块的作用是Unicode文本的ASCII音译。
# -*- coding: utf-8 -*-
from unidecode import unidecode
class BezrealitkySpider(scrapy.Spider):
name = 'bezrealitky'
start_urls = [
'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
]
def parse(self, response):
item = BezrealitkyItem()
items = []
for records in response.xpath('//*[starts-with(@class,"record")]'):
item['title'] = unidecode(response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8'))
items.append(item)
return(items)
因为我使用ItemLoader我的代码看起来像这样:
# -*- coding: utf-8 -*-
from scrapy.loader import ItemLoader
class BaseItemLoader(ItemLoader):
title_in = MapCompose(unidecode)