Question

我正在尝试使用BeautifulSoup解析htm本地文件。

.htm是文件类型。

from bs4 import BeautifulSoup
with open('locfile.htm') as fp:
   soup = BeautifulSoup(fp, "html5lib")
print(soup)

尝试三个不同的解析器，但获得相同的结果。 html5lib的示例

<html><body><p>t a b l e   i d = " T a b l a D a t a "   c l a s s = " T a b l a    w i d t h = " 9 0 %  &gt; 
 t r &gt;....

.....

，依此类推。我认为“＆gt”只是但被转换为那些字符串。

使用html.parser和html5llib

获得相似的结果

如何将标签保留在体内？

这是错误的解析操作吗？

soup.contents
[<html><head></head><body>&lt;table id=........
..................
</body></html>

但内部标签已丢失，或已转换为html转义字符

如何维护标签？

Answer 1

最后我找到了解决方法。

问题在于原始文件的编码：

with open('locfile.htm',encoding="utf-16LE") as fp: