Question

我正在下载一个网站的html并用这样漂亮的汤来美化它：

f_page_soup = Soup(f_driver.page_source, "lxml")
with open(f_filename_pretty, 'wb') as f_output:
    f_output.write(f_page_soup.prettify(encoding='utf-8'))

打开这样美化的html：

ecjData = open(filename, 'r', encoding='utf-8').read()
    pageSoup = Soup(ecjData, "lxml")

html内部是我想用BeautifulSoup收集的不同链接。其中一个看起来像example.com/weiß/3

迭代完所有要打印的链接后。这样做：

print ("https://example.com" + a["href"])

按预期为上面的链接提供UnicodeEncodeError。

所以在捕获错误后我尝试解码它：

print (("https://example.com" + a["href"]).encode('utf-8').decode('latin-1'))

导致

'ascii' codec can't encode characters in position 78-79: ordinal not in range(128)

我尝试的另一种方法是在字符串中替换：

print (str(("https://example.com" + a["href"]).encode('utf-8')).replace('\\xc3\\x9f','ß'))

然后再次导致：

'ascii' codec can't encode character '\xdf' in position 80: ordinal not in range(128)

基本上我需要打印的是：

"https://example.com/weiß/3"

我怎样才能做到这一点？我正在使用python 3.5。

Python将'\ xdf'/'\ xc3 \ x9f'编码为'ß'

0 个答案: