Question

我有以下xml：

<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>

将<![CDATA[内容]]>解析并提取到列表中的最有效方法是什么？让我们说：

[@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING      Ugh     YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt      @username Shout out to me????       ]

这就是我的尝试：

from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out

这是输出：

[<document>"@username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING        </document>]

此输出的问题是我不应该获得<document></document>。如何删除<document></document>标签并在列表中获取此xml的所有元素？

Answer 1

这里有几个问题。（询问选择图书馆的问题是违反规则的，所以我忽略了问题的这一部分。）

您需要传入文件句柄，而不是文件名称。

即：y = BeautifulSoup(open(x))
你需要告诉BeautifulSoup它正在处理XML。

即：y = BeautifulSoup(open(x), 'xml')
CDATA部分不会创建元素。您无法在DOM中搜索它们，因为它们不存在于DOM中;他们只是语法糖。只需查看document下的文字，不要尝试搜索名为CDATA的内容。

要再次说明，有所不同：<doc><![CDATA[foo]]</doc> 与<doc>foo</doc>完全相同。 CDATA部分的不同之处在于其中的所有内容都会自动转义，这意味着<![CDATA[<hello>]]被解释为<hello>。但是 - 您无法从解析的对象树中判断您的文档是否包含CDATA部分，其中包含文字<和>，或者原始文本部分包含<和{ {1}}。这是设计的，也适用于任何兼容的XML DOM实现。

现在，一些实际工作的代码如何：

&gt;

如果您想从文件中读取内容，请将import bs4 doc=""" <?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23"> <document><![CDATA["@username: That came at the wrong time ????" HELP I'M DYING ]]></document> <document><![CDATA[Ugh ]]></document> <document><![CDATA[YES !!!! WE GO FOR IT. ]]></document> <document><![CDATA[@username Shout out to me???? ]]></document> </author> """ doc_el = bs4.BeautifulSoup(doc, 'xml') print [ el.text for el in doc_el.findAll('document') ]替换为doc。

如何使用python从xml中提取效率<！ - [CDATA [] - >内容？

1 个答案: