Question

我最近一直在使用Python，我想从给定的xml文件中提取信息。问题是信息存储得很糟糕，格式如下

<Content>
   <tags>
   ....
   </tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>

我不能在这里发布整个数据，因为它大约有20,000行。我只想收到包含[“string1”，“string2”，...]的列表，这是我到目前为止使用的代码：

import xml.etree.ElementTree as ET

tree = ET.parse(xmlfile)
for node in tree.iter('Content'):
    print (node.text)

然而我的输出是没有的。我如何收到评论数据？（再次，我使用的是Python）

Answer 1

问题在于您的评论似乎并不标准。标准评论为，就像这样。

这些评论可以用Beautifulsoup解析，例如：

from bs4 import BeautifulSoup, Comment

xml = """<Content>
   <tags>
   ...
   </tags>
<!--[CDATA["string1"; "string2"; ....]]-->
</Content>"""
soup = BeautifulSoup(xml)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
print(comments)

返回['[CDATA["string1"; "string2"; ....]]']从中可以轻松解析所需的字符串。

如果您有非标准评论，我建议使用正则表达式，如：

import re
xml = """<Content>
   <tags>
   asd
   </tags>
<![CDATA["string1"; "string2"; ....]]>
</Content>"""
for i in re.findall("<!.+>",xml):
    for j in re.findall('\".+\"', i):
        print(j)

返回："string1"; "string2"

Answer 2

您需要创建基于SAX的解析器而不是基于DOM的解析器。特别是与你的文件一样大。

基于sax的解析器要求您在数据存储方式中编写自己的控制逻辑。它比简单地将它加载到DOM中更复杂，但是因为它逐行加载而不是一次加载整个文档要快得多。这给了它的优势，它可以处理像你这样的简单案件的评论。

构建处理程序时，您可能希望使用解析器中的LexicalHandler来提取这些注释。

我会给你一个关于如何构建一个的实例，但是自从我自己完成它以来已经很长时间了。有很多关于如何在线构建基于sax的解析器的指南，并将讨论推迟到另一个线程。

Answer 3

使用Python 3.8，您可以在元素树中插入评论

用于读取XML中的属性，值，标签和注释的示例代码

import csv, sys
import xml.etree.ElementTree as ET


parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True))  # Python 3.8
            tree = ET.parse(infile_path, parser)

            csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)

            COMMENT = ""
            TAG =""
            NAME=""

            # Get the comment nodes
            for node in tree.iter():
                if "function Comment" in str(node.tag):
                    COMMENT = node.text
                else:
                    #read tag
                    TAG = node.tag  # string

                    #read attributes 
                    NAME= node.attrib.get("name")  # ID
                      
                    #Value
                    VALUE = node.text  # value

                    print(TAG, NAME, VALUE, COMMENT)

我如何在python中正确解析xml注释

3 个答案:

使用Python 3.8，您可以在元素树中插入评论