Question

我有一个爬虫程序，它解析给定站点的HMTL并打印部分源代码。这是我的剧本：

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
import re

class Crawler:

    headers = {'User-Agent' : 'Mozilla/5.0'}
    keyword = 'arroz'

    def extra(self):
        url = "http://buscando.extra.com.br/search?w=" + self.keyword
        r = requests.head(url, allow_redirects=True)    
        print(r.url)
        html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read()
        soup = BeautifulSoup(html, 'html.parser')
        return soup.encode('utf-8')

    def __init__(self):
        extra = self.extra()
        print(extra)

Crawler()

我的代码运行正常，但它在值之前打印出令人讨厌的b'来源。我已经尝试使用decode('utf-8')，但它没有用。有什么想法吗？

更新

如果我不使用encode('utf-8')，我会遇到以下错误：

Traceback (most recent call last):
  File "crawler.py", line 25, in <module>
    Crawler()
  File "crawler.py", line 23, in __init__
    print(extra)
  File "c:\Python34\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position
13345: character maps to <undefined>

Answer 1

当我使用return soup.encode('utf-8')替换return soup时，我按照给定的代码运行代码，它运行正常。我的环境：

操作系统：Ubuntu 15.10
Python：3.4.3
python3 dist-packages bs4 version：beautifulsoup4==4.3.2

这让我怀疑问题在于您的环境，而不是您的代码。您的堆栈跟踪提到cp850.py并且您的来源正在访问.com.br网站 - 这让我觉得shell的默认编码可能无法处理unicode字符。这是cp850 - Code Page 850的维基百科页面。

您可以查看终端使用的默认编码：

>>> import sys
>>> sys.stdout.encoding

我的终端响应：

'UTF-8'

我假设你赢了，而且这是你遇到的问题的根源。

修改：

事实上，我可以用以下内容完全复制您的错误：

>>> print("\u2030".encode("cp850"))

这就是问题所在 - 由于您的计算机的区域设置，print会隐式转换为系统的默认编码并引发UnicodeDecodeError。

更新Windows以在命令提示符下显示unicode字符有点在我的驾驶室之外，因此除了引导您到relevant question/answer之外，我无法提供任何建议。

Python：将二进制文件打印为字符串

1 个答案: