Question

所以这就是问题所在。例如，我想从

下载并打印所有可能语言的列表

https://www.fanfiction.net/game/Pok%C3%A9mon/

（在＆＃39;过滤器＆＃39;按钮下可见）。

在HTML中，它表示为以下一系列选项：

<option value='17' >Svenska<option value='31' >čeština<option value='10' >Русский
<option value='39' >देवनागरी<option value='38' >ภาษาไทย<option value='5' >中文<option value='6' >日本語

我使用urllib.request包下载它

def getByUrl(self,url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html

然后，我尝试像这样显示它：

@staticmethod
def fromCollection_getPossibleLanguages(self,pageContent):
        parsedHtml = BeautifulSoup(pageContent)
        possibleMatches = parsedHtml.findAll('select',{'name':'languageid','class':'filter_select'})
        possibleMatches = possibleMatches[0].findAll('option')

        for match in possibleMatches:
            print(str(match.text.encode('unicode')) + " - " + str(match.get('value')))

但是，我尝试使用.encode（）函数（例如传递＆utff-8＆＃39;或者＃unicode＆＃39; args）的所有尝试都无法显示任何内容，因为例如：

b'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9' - 10

我在mac os x终端和Eclipse的控制台视图中显示它 - 结果相同

Answer 1

您根本不需要编码。 BeautifulSoup已经将响应字节解码为Unicode值，print()可以处理其余部分。

但是，页面格式不正确，因为没有关闭</option>标记。这可能会混淆标准的HTML解析器。安装lxml or the html5lib package，可以正确解析页面：

parsedHtml = BeautifulSoup(pageContent, 'lxml')

或

parsedHtml = BeautifulSoup(pageContent, 'html5lib')

接下来，您可以使用one CSS selector选择<option>代码：

possibleMatches = parsedHtml.select('select[name=languageid] option')

for match in possibleMatches:
    print(match.text, "-", match.get('value'))

演示：

>>> possibleMatches = soup.select('select[name=languageid] option')
>>> for match in possibleMatches:
...     print(match.text, "-", match.get('value'))
... 
Language - 0
Bahasa Indonesia - 32
Català - 34
Deutsch - 4
Eesti - 41
English - 1
Español - 2
Esperanto - 22
Filipino - 21
Français - 3
Italiano - 11
Język polski - 13
LINGUA LATINA - 35
Magyar - 14
Nederlands - 7
Norsk - 18
Português - 8
Română - 27
Suomi - 20
Svenska - 17
čeština - 31
Русский - 10
देवनागरी - 39
ภาษาไทย - 38
中文 - 5
日本語 - 6

Python 3：无法使用/ xXX文字正确编码和打印下载的字符串

1 个答案: