我一直在努力评估类似标签的价值,并将输出作为单个标签获取,如下所示。
xml输入:
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2</MainTitle>
<MainTitle>text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String1</Title>
<Title>String2</Title>
<Title>String3</Title>
<Title>String4</Title>
<Title>String5</Title>
<Title>String6</Title>
<Title>String7</Title>
<Title>String8</Title>
</slide>
</data>
</root>
预期产出:
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2</MainTitle>
<MainTitle>text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String</Title>
</slide>
</data>
</root>
任何帮助都会非常感激。谢谢!!
答案 0 :(得分:0)
您需要以递归方式对常用标记进行分组。这是允许传递函数的实现,它决定如何处理文本:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import itertools
import operator
import os.path
from lxml import etree
text = """
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2</MainTitle>
<MainTitle>text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String1</Title>
<Title>String2</Title>
<Title>String3</Title>
<Title>String4</Title>
<Title>String5</Title>
<Title>String6</Title>
<Title>String7</Title>
<Title>String8</Title>
</slide>
</data>
</root>
"""
def combine_elements(elements, combine_text=', '.join):
result = []
for key, group in itertools.groupby(elements, operator.attrgetter('tag')):
items = list(group)
first_item = items[0]
# combine only if item don't have children
if len(items) > 1 and not len(first_item):
combined = combine_text([el.text for el in items])
# and if combine_text returned something, e.g. strings have
# common prefix
if combined:
first_item.text = combined
result.append(first_item)
continue
result.extend(items)
elements[:] = result
# recursively combine others
for element in elements:
combine_elements(element, combine_text)
doc = etree.fromstring(text)
combine_elements(doc, os.path.commonprefix)
print etree.tostring(doc)
使用os.path.commonprefix()
作为文本合并器,您将获得以下结果:
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2</MainTitle>
<MainTitle>text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String</Title>
</slide>
</data>
</root>
如果您希望所有文本与斜杠/
结合使用(例如),您可以使用以下内容:
doc = etree.fromstring(text)
combine_elements(doc, ' / '.join)
结果:
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2 / text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String1 / String2 / String3 / String4 / String5 / String6 / String7 / String8</Title>
</slide>
</data>
</root>