Question

使用chineese simbols刮擦网站。我如何报废chineese simbolse ??

from urllib.request import urlopen
from urllib.parse import urljoin

from lxml.html import fromstring

URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'

def parse_items():
    f = urlopen(URL)
    list_html = f.read().decode('utf-8')
    list_doc = fromstring(list_html)

    for elem in list_doc.cssselect(ITEM_PATH):
        a = elem.cssselect('a')[0]
        href = a.get('href')
        title = a.text
        em = elem.cssselect('em')[0]
        title2 = em.text
        print(href, title, title2)


def main():
    parse_items()

if __name__ == '__main__':
    main()

错误看起来像这样。错误看起来像这样错误看起来像这样错误看起来像这样错误看起来像这样

http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
  File "parser.py", line 27, in <module>
    main()
  File "parser.py", line 24, in main
    parse_items()
  File "parser.py", line 20, in parse_items
    print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Answer 1

从print语法和导入，我假设您使用的是Python3版本，因为它对于unicode很重要。

因此，我们可以预期href，title和title2都是unicode字符串（或Python 3字符串）。但是print函数会尝试将字符串转换为输出系统可接受的编码 - 由于我无法知道的原因，系统默认使用ASCII，所以错误。

如何解决：

最好的方法是让你的系统接受unicode。在Linux或其他unix上，您可以在LANG环境变量（export LANG=en_US.UTF-8）中声明UTF8字符集，在Windows上可以尝试chcp 65001，但后者如果不确定
如果它不起作用，或者不能满足您的需求，您可以强制显式编码，或者更精确地过滤掉有问题的字符，因为Python3本身使用unicode字符串。

我会用：

import sys

def u_filter(s, encoding = sys.stdout.encoding):
    return (s.encode(encoding, errors='replace').decode(encoding)
        if isinstance(s, str) else s)

这意味着：如果s是一个unicode字符串，则在用于stdout的编码中对其进行编码，用替换字符替换任何不可转换的字符，并将其解码回现在干净的字符串

和下一个：

def fprint(*args, **kwargs):
    fargs = [ u_filter(arg) for arg in args ]
    print(*fargs, **kwargs)

表示：从unicode字符串中过滤掉任何有问题的字符，并打印剩余的字符串。

有了这个，您可以安全地用以下内容替换抛出异常的打印：

fprint(href, title, title2)

编码错误抓取

1 个答案: