如何从网络中提取数据时摆脱特殊字符?

时间:2014-08-26 09:24:57

标签: python scrapy

我从网站上提取数据,它有一个包含特殊字符Comfort Inn And Suites�? Blazing Stump的条目。当我尝试提取它时,它会抛出一个错误:

    Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
    taskObj._oneWorkUnit()
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
    result = next(self._iterator)
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
    yield it.next()
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
    for x in result:
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
    print repr(business.select('a[@class="name"]/text()').extract()[0])
  File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
    result = self.xpathev(xpath)
  File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)

  File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)

  File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)

  File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)

  File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)

  File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)

  File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte

我在网上搜索了decode('utf-8')unicodedata.normalize('NFC',business.select('a[@class="name"]/text()').extract()[0])之后尝试了很多不同的东西,但问题仍然存在?

源网址为“http://www.truelocal.com.au/find/hotels/97/”,在此页面上,这是我正在讨论的第四个条目。

2 个答案:

答案 0 :(得分:4)

原始网页中有Mojibake个错误, 可能是由于某些地方的数据输入中的Unicode处理不当。当以十六进制表示时,源中的实际UTF-8字节为C3 3F C2 A0

认为曾经是U+00A0 NO-BREAK SPACE。编码为变为C2 A0的UTF-8,将 解释为Latin-1而不是编码为UTF-8再次变为C3 82 C2 A0,但82是一个控件字符如果再次被解释为Latin-1,则在编码时被?问号替换为十六进制3F

当您点击detail page for that venue的链接时,您会获得一个名称相同的Mojibake:Comfort Inn And Suites Blazing Stump,为我们提供Unicode字符U + 00C3,U + 201A,U + 00C2 a { {1}} HTML实体,或再次使用unicode字符U + 00A0。将其编码为Windows Codepage 1252(Latin-1的超集),然后再次获得&nbsp;

您只能通过直接在页面来源中定位

来摆脱它
C3 82 C2 A0

通过用原始的预期UTF-8字节替换火车残骸来“修复”数据。

如果你有一个scrapy pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0') 对象,请替换正文:

Response

答案 1 :(得分:0)

请勿使用“替换”来修复Mojibake,请修复导致Mojibake的数据库和代码。

但是首先,您需要确定它是Mojibake还是“双重编码”。用SELECT col, HEX(col) ...确定单个字符是变成2-4个字节(Mojibake)还是4-6个字节(双编码)。例子:

`é` (as utf8) should come back `C3A9`, but instead shows `C383C2A9`
The Emoji `` should come back `F09F91BD`, but comes back `C3B0C5B8E28098C2BD`

查看“ Mojibake”和“双重编码” here

然后讨论here的数据库修复程序:

  • 字符集latin1,但其中包含utf8字节;修复字符集时,不留任何字节:

首先,假设您具有tbl.col的声明:

col VARCHAR(111) CHARACTER SET latin1 NOT NULL

然后通过此两步ALTER来转换列而不更改字节:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

注意:如果以TEXT开头,请使用BLOB作为中间定义。 (这是“两步ALTER,如其他地方所讨论。)(请确保其他规格保持不变-VARCHAR,NOT NULL等。)

  • CHARACTER SET utf8mb4,具有双重编码: UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

  • CHARACTER SET latin1(具有双重编码):执行两步ALTER,然后修复双重编码。