Question

我使用python，requests和beautifulsoup4解析来自Icecast服务器的/admin/state.xml

import requests
from bs4 import BeautifulSoup


r = requests.get('<host>/admin/state.xml', auth=('u', 'p'))
soup = BeautifulSoup(r.text, 'lxml-xml', from_encoding='ISO-8859-1')

mount_point_metadata = []
for mp in soup.find_all('source):
    meta = {}
    meta['mount_point'] = mp.get('mount')[1:]

    try:
        meta['server_name'] = (mp.find('server_name').text)
    except AttributeError, e:
        pass
    mount_point_metadata.append(meta)

代码工作正常，并检索预期的数据。但是，当我检查mount_point_metadata - 字典字符串时，挪威字符有问题，并且所有值都是utf-8：

{'mount_point': u'<name redacted>,
 'server_name': u'<redacted> st\xf8rste!}

（在这种情况下，\xf8应该是字母ø）

即使我使用from_encoding='ISO-8859-1为BeautifulSoup提供正确的编码，这会发生什么？

Answer 1

只需对检索到的数据使用.encode('utf-8')即可。我猜你的代码会是这样的：

meta['mount_point'] = map(lambda s: s.encode("utf-8"), mp.get('mount')[1:])

在BeautifulSoup4中解析为utf-8的字符串，即使使用from_encoding设置正确的字符集也是如此

1 个答案: