Question

我有一个Python脚本，它使用Beautiful Soup从目录中的HTML文件中提取文本。但是，我无法使编码正常工作。起初我虽然HTML文件本身可能存在问题。但是，当我在Notepad.exe中查看HTML文件的来源时，我会看到：Vi er her for deg, og du må gjerne ta kontakt med oss på 815 32 000 eller på Facebook om du har noen spørsmål.

但是，当我在Internet Explorer中查看相同的HTML文件时，我会看到：Vi er her for deg, og du mÃ¥ gjerne ta kontakt med oss pÃ¥ 815 32 000 eller pÃ¥ Facebook om du har noen spÃ¸rsmÃ¥l.

并且，Internet Explorer文本与我的Python脚本附加到文本文件中的文本相同。因此，显然编码是可检测的，并且IE不理解它并不奇怪，但我似乎无法弄清楚为什么Python无法处理它。编码应该是latin-1，我认为这不是问题。这是我的代码：

import os
import glob
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read())
    with open("example.txt", "a") as myfile:
        myfile.write(soup.get_text())
        myfile.close()

由于这似乎打破了编码，我认为我可以传递latin-1编码，就像这样：

soup = BeautifulSoup(open(markup, "r").read())
soup = soup.prettify("latin-1")

但这给了我错误：

Traceback (most recent call last):
  File "bsoup.py", line 12, in <module>
    myfile.write(soup.get_text())
AttributeError: 'bytes' object has no attribute 'get_text'

Answer 1

.prettify()已经返回字节，因此您只需将其直接写入文件，但必须以二进制模式打开该文件（请注意下面使用的'ab'模式）：

soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "ab") as myfile:
    myfile.write(soup.prettify('latin-1'))

无需致电myfile.close(); with语句已经解决了这个问题。

要仅保存文本，请以文本模式（'a'）打开文件，并指定要在保存时使用的编码：

soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "a", encoding='latin-1') as myfile:
    myfile.write(soup.get_text())

现在Python会自动将unicode文本编码为latin-1。

当您看到Ã¥而不是å之类的内容时，您将UTF-8字节解释为Latin-1。

您可能想要阅读Python和Unicode：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Python Unicode HOWTO
Pragmatic Unicode

使用Beautiful Soup从HTML文件中提取挪威文本，丢失挪威字符

1 个答案: