我有以下XML文档,以ISO-8859-15编码与Notepad ++一起保存:
<?xml version="1.0" encoding="ISO-8859-15"?>
<someTag>
</someTag>
我尝试使用bs4解析此文件,但是无论如何(即使在我能想到的任何地方指定编码),我都会得到一个空结果:
filepath = 'iso-8859-15_example.xml'
with open(filepath, encoding="iso-8859-15") as f:
soup = BeautifulSoup(f, 'xml', from_encoding="iso-8859-15")
print(soup)
# --> "<?xml version="1.0" encoding="utf-8"?>", otherwise empty
删除Python代码中的编码提示无济于事。但是奇怪的是,有效的方法是删除XML文件的第一行,即<?xml ... ?>
语句(我认为是“ prolog”。)
我在这里做错了什么?我认为序言将帮助bs4“做正确的事”并选择正确的编码。除了删除序言/带有XML文件编码的消息外,还有其他选择吗?
答案 0 :(得分:0)
在这种情况下,我建议运行BeautifulSoup的diagnose()
函数:
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', encoding="iso-8859-15") as f:
diagnose(f.read())
在我的机器上打印:
Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" encoding="ISO-8859-15"?-->
<html>
<head>
</head>
<body>
<sometag>
</sometag>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
<body>
<sometag>
</sometag>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
--------------------------------------------------------------------------------
在这种情况下,我会选择html.parser
,因为它会做正确的事。
因此,当您这样做时:
soup = BeautifulSoup(f.read(), 'html.parser')
print(soup)
它打印:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
答案 1 :(得分:0)
结合Andrej的答案和the duplicate question中给出的答案,我可以看到在open
调用中指定原始模式可以解决我的问题:
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', 'rb') as f:
diagnose(f)
这导致输出
Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 4.3.4.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
<body>
<sometag>
</sometag>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<someTag>
</someTag>
--------------------------------------------------------------------------------
并显示xml模式下的lxml效果很好。