Question

我的XML杂乱无章，标签结构如-

<textTag>
    <div xmlns="http://www.tei-c.org/ns/1.0"><p> -----some text goes here-----
    </p>
    </div>
</textTag>

我想提取-----some text goes here-----，进行一些更改，然后将其放回XML中。我该怎么办？

Answer 1

选项1：

您可以使用python的xml模块来解析，更新和保存xml文件。但是，问题在于，生成的xml文件可能具有与原始xml文件不同的属性顺序。因此，当您进行比较时，可能会发现很多差异。

所以您可能会做类似的事情。

from xml.etree import ElementTree as ET

tree = ET.parse('xmlfilename')
root = tree.getroot()
p_nodes = root.findall('.//<p>')
for node in p_nodes:
   # process
tree.save()

选项2：

使用正则表达式。

逐行读取文件，查找您感兴趣的模式，然后进行更新并将其写回。明显的优势是原始文件和修改过的文件之间的差异仅显示您所做的更新。

import re

with open(outputfile) as fout:
   with open(xmlfile) as f:
      data = f.readlines()
      pattern = re.compile(r"...") # your pattern
      for line in data:
         re.sub(line, pattern, update)
         fout.write(line)

Answer 2

您可以使用lxml（具有更好的XPath 1.0 support than ElementTree）来查找所有包含“ -----某些文本在此处-----”的text()节点，修改文本，然后替换父级的.text (or .tail)。

示例...

Python 3.x

from lxml import etree

xml = """
<textTag>
    <div xmlns="http://www.tei-c.org/ns/1.0"><p> <br/>-----some text goes here-----
    </p>
    </div>
</textTag>"""

tree = etree.fromstring(xml)
for text in tree.xpath(".//text()[contains(.,'-----some text goes here-----')]"):
    parent = text.getparent()
    new_text = text.replace("-----some text goes here-----", "---- BAM! ----")
    if text.is_text:
        parent.text = new_text
    elif text.is_tail:
        parent.tail = new_text

etree.dump(tree)

输出（转储到控制台）

<textTag>
    <div xmlns="http://www.tei-c.org/ns/1.0"><p> ---- BAM! ----
    </p>
    </div>
</textTag>

如何基于标记从XML提取文本然后放回（Python）？

2 个答案: