背景和原因

Question

我正在研究用于解析用XML编写的配置文件的代码，其中XML标签是大小写混合的，并且案例很重要。 Beautiful Soup似乎默认将XML标记转换为小写，我想改变这种行为。

我不是第一个就这个问题提出问题的人[见here]。但是，我不理解该问题的答案，并且在BeautifulSoup-3.1.0.1中，BeautifulSoup.py似乎不包含“encodedName”或“Tag.__str__”的任何实例

Answer 1

import html5lib
from html5lib import treebuilders

f = open("mydocument.html")
parser = html5lib.XMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
document = parser.parse(f)

'document'现在是一个类似BeautifulSoup的树，但保留了标签的情况。有关文档和安装，请参阅html5lib。

Answer 2

根据美丽汤的创作者Leonard Richardson的说法，can't。

Answer 3

使用lxml要好得多。它比BeautifulSoup快得多。如果您不想学习lxml API，它还有BeautifulSoup的兼容性API。

Ian Blicking agrees

没有理由再使用BeautifulSoup了，除非您使用的是Google App Engine或其他不允许使用Python的东西。

它也更适合XML。

Answer 4

背景和原因

首先我们应该知道：html解析器不区分大小写，因此请将标记转换为小写

并且：Beautifulsoup内部调用一些parser来解析输入html / xml。

->对于最新的bs4 = BeautifulSoup v4，默认使用html.parer。

soup = BeautifulSoup(yourXmlStr, 'html.parser')

但是 ALL html parser不区分大小写，因此html.parer

（和另外两个，在official doc中说：

lxml
- BeautifulSoup(yourHtmlOrXmlStr, "lxml")
html5lib
- BeautifulSoup(yourHtmlOrXmlStr, "html5lib")

），会将 TAG 转换为小写的标记

示例

输入：

<?xml version="1.0" encoding="UTF-8"?>
<XCUIElementTypeApplication type="XCUIElementTypeApplication" name="微信"
    label="微信" enabled="true" visible="true" x="0" y="0" width="375" height="667">
    <XCUIElementTypeWindow
            type="XCUIElementTypeWindow" enabled="true" visible="true" x="0" y="0" width="375" height="667">
    </XCUIElementTypeWindow>
</XCUIElementTypeApplication>

输出：

<?xml version="1.0" encoding="UTF-8"?>
    <xcuielementtypeapplication enabled="true" height="667" label="微信" name="微信" type="XCUIElementTypeApplication" visible="true" width="375" x="0" y="0">
    <xcuielementtypewindow enabled="true" height="667" type="XCUIElementTypeWindow" visible="true" width="375" x="0" y="0">
    </xcuielementtypewindow>
    </xcuielementtypeapplication>

如何`disable` BeautifulSoup标记自动小写转换？

解决方案：更改为 xml 解析器
原因： xml 解析器支持标签区分大小写
- ->不会将标签自动转换为所有小写字母
代码

soup = BeautifulSoup(yourXmlStr, 'xml')

与：

soup = BeautifulSoup(yourXmlStr, 'lxml-xml')

输出示例：

<?xml version="1.0" encoding="utf-8"?>
    <XCUIElementTypeApplication enabled="true" height="667" label="微信" name="微信" type="XCUIElementTypeApplication" visible="true" width="375" x="0" y="0">
    <XCUIElementTypeWindow enabled="true" height="667" type="XCUIElementTypeWindow" visible="true" width="375" x="0" y="0">
    </XCUIElementTypeWindow>
    </XCUIElementTypeApplication>

我可以更改BeautifulSoup关于将XML标记转换为小写的行为吗？

5 个答案:

背景和原因

示例

如何`disable` BeautifulSoup标记自动小写转换？

更多细节

我可以更改BeautifulSoup关于将XML标记转换为小写的行为吗？

5 个答案:

背景和原因

示例

如何disable BeautifulSoup标记自动小写转换？

更多细节

如何`disable` BeautifulSoup标记自动小写转换？