Question

我的xml文件中有一些文本标签（pdf使用popplers-utils的pdftohtml转换为xml），如下所示：

<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>

我可以使用此示例代码获取带有文本标记的文本：

import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')

some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue

# but no if <i></i> wrap only one word of the string

但是如果它内容中的另一个标记(<i> or <b>...)并且无法获取对象，我就无法获得“nodeValue”

将所有文本作为纯字符串（如javascript innerHTML方法）或递归到子标记中的最佳方法是什么，即使它们包含一些单词而不是整个nodeValue？

感谢

Answer 1

参加聚会的时机已经太晚了……除了我想要结果字符串中的标记外，我遇到了类似的问题。这是我的解决方案：

DeepReplace<T, C, R>

这应该返回包含标签的确切XML。

Answer 2

**问题：如何使用minidom将内容作为字符串获取

这是一个递归解决方案，例如：

def getText(nodelist):
    # Iterate all Nodes aggregate TEXT_NODE
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
        else:
            # Recursive
            rc.append(getText(node.childNodes))
    return ''.join(rc)


xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')

# Iterate <text ..>...</text> Node List
for node in nodelist:
    print(getText(node.childNodes))

<强>输出：

..part of old large book
with many many pages and some italics text among 'plain' text and more and more text
etc...

使用Python测试：3.4.2

如何使用xml.dom中的minidom将内容作为字符串获取？

2 个答案: