Question

我正在用“美丽的汤”抓取页面，并且输出包含以十六进制显示的非标准拉丁字符。

我正在抓取https://www.archchinese.com。它包含拼音单词，这些单词使用非标准的拉丁字符（例如ǎ，ā）。我一直在尝试遍历包含拼音的一系列链接，使用BeautifulSoup .string函数以及utf-8编码来输出这些单词。这个单词在非标准字符的地方以十六进制显示。单词“hǎo”作为“ h \ xc7 \ x8eo”出现。我确定我在编码方面做错了，但是我不知道该怎么解决。我先尝试使用utf-8解码，但是却收到错误消息，该元素没有解码功能。尝试不进行编码就打印字符串会给我一个关于未定义字符的错误，我认为这是因为首先需要将它们编码为某种东西。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

url = "https://www.archchinese.com/"

driver = webdriver.Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)

driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.

python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.

soup=BeautifulSoup(driver.page_source, 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.

实际结果： b'h \ xc7 \ x8eo' b'h \ xc3 \ xa0o'

预期结果： ǎ

编辑：问题似乎与Windows中的UnicodeEncodeError有关。我尝试安装win-unicode-console，但是没有运气。感谢snakecharmerb提供的信息。

Answer 1

打印时无需对值进行编码-打印功能将自动处理此问题。现在，您正在打印组成编码值的字节的表示，而不仅仅是字符串本身。

>>> s = 'hǎo'
>>> print(s)
hǎo

>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'

Answer 2

在调用BeautifulSoup时使用编码，而不要在以后使用。

soup=BeautifulSoup(driver.page_source.encode('utf-8'), 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string)

如何在Python中编码/解码这个BeautifulSoup字符串，以便输出非标准的拉丁字符？

2 个答案: