将类似标签的值与一个标签相结合

时间:2013-04-30 14:06:37

标签: xml python-2.7 lxml elementtree minidom

我一直在努力评估类似标签的价值,并将输出作为单个标签获取,如下所示。

xml输入:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String1</Title> 
             <Title>String2</Title> 
             <Title>String3</Title> 
             <Title>String4</Title> 
             <Title>String5</Title> 
             <Title>String6</Title> 
             <Title>String7</Title> 
             <Title>String8</Title> 
         </slide>
     </data>
 </root>

预期产出:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String</Title>
        </slide>
     </data>
 </root>

任何帮助都会非常感激。谢谢!!

1 个答案:

答案 0 :(得分:0)

您需要以递归方式对常用标记进行分组。这是允许传递函数的实现,它决定如何处理文本:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import itertools
import operator
import os.path

from lxml import etree


text = """
<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String1</Title> 
             <Title>String2</Title> 
             <Title>String3</Title> 
             <Title>String4</Title> 
             <Title>String5</Title> 
             <Title>String6</Title> 
             <Title>String7</Title> 
             <Title>String8</Title> 
        </slide>
    </data>
</root>
"""


def combine_elements(elements, combine_text=', '.join):
    result = []
    for key, group in itertools.groupby(elements, operator.attrgetter('tag')):
        items = list(group)
        first_item = items[0]
        # combine only if item don't have children
        if len(items) > 1 and not len(first_item):
            combined = combine_text([el.text for el in items])
            # and if combine_text returned something, e.g. strings have 
            # common prefix
            if combined:
                first_item.text = combined
                result.append(first_item)
                continue
        result.extend(items)
    elements[:] = result
    # recursively combine others
    for element in elements:
        combine_elements(element, combine_text)


doc = etree.fromstring(text)
combine_elements(doc, os.path.commonprefix)
print etree.tostring(doc)

使用os.path.commonprefix()作为文本合并器,您将获得以下结果:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String</Title> 
             </slide>
    </data>
</root>

如果您希望所有文本与斜杠/结合使用(例如),您可以使用以下内容:

doc = etree.fromstring(text)
combine_elements(doc, ' / '.join)

结果:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2 / text3</MainTitle> 
             </slide>
        <slide name="file.xml">
             <Title>String1 / String2 / String3 / String4 / String5 / String6 / String7 / String8</Title> 
             </slide>
    </data>
</root>