在BeautifulSoup4中解析为utf-8的字符串,即使使用from_encoding设置正确的字符集也是如此

时间:2015-11-25 10:25:17

标签: python encoding beautifulsoup python-requests

我使用python,requests和beautifulsoup4解析来自Icecast服务器的/admin/state.xml

import requests
from bs4 import BeautifulSoup


r = requests.get('<host>/admin/state.xml', auth=('u', 'p'))
soup = BeautifulSoup(r.text, 'lxml-xml', from_encoding='ISO-8859-1')

mount_point_metadata = []
for mp in soup.find_all('source):
    meta = {}
    meta['mount_point'] = mp.get('mount')[1:]

    try:
        meta['server_name'] = (mp.find('server_name').text)
    except AttributeError, e:
        pass
    mount_point_metadata.append(meta)

代码工作正常,并检索预期的数据。但是,当我检查mount_point_metadata - 字典字符串时,挪威字符有问题,并且所有值都是utf-8:

{'mount_point': u'<name redacted>,
 'server_name': u'<redacted> st\xf8rste!}

(在这种情况下,\xf8应该是字母ø

即使我使用from_encoding='ISO-8859-1为BeautifulSoup提供正确的编码,这会发生什么?

1 个答案:

答案 0 :(得分:0)

只需对检索到的数据使用.encode('utf-8')即可。我猜你的代码会是这样的:

meta['mount_point'] = map(lambda s: s.encode("utf-8"), mp.get('mount')[1:])