如何使用python根据特定条件删除父子节点

时间:2018-11-28 09:53:34

标签: python

示例XML文件

<ArticleSet>
    <Article>
        <ForeName>a</ForeName>
        <LastName>b</LastName>
        <Affiliation>harvard university of science. abc@gmail.com</Affiliation>
    </Article>
    <Article>
        <ForeName>a</ForeName>
        <LastName>b</LastName>
        <Affiliation>-</Affiliation>
    </Article>
    <Article>
        <ForeName>a</ForeName>
        <LastName>b</LastName>
        <Affiliation>harvard university of science. ghi@yahoo.co.in</Affiliation>
    </Article>
</ArticleSet>

我要删除所有值为-的文章。即其所属关系看起来像<Affliation>-</Affliation>

所需的输出

<ArticleSet>
    <Article>
        <ForeName>a</ForeName>
        <LastName>b</LastName>
        <Affiliation>harvard university of science. abc@gmail.com</Affiliation>
    </Article>
    <Article>
        <ForeName>a</ForeName>
        <LastName>b</LastName>
        <Affiliation>harvard university of science. ghi@yahoo.co.in</Affiliation>
    </Article>
</ArticleSet>

2 个答案:

答案 0 :(得分:0)

这将从input.xml中读取XML,并将修改后的文档写入output.xml

import xml.etree.ElementTree as ET

dom = ET.parse('input.xml')
root = dom.getroot()

for article in root.findall('Article'):
    if article.find('Affiliation').text == '-':
        root.remove(article)

dom.write('output.xml')

编辑:使用lxml,我获得了明显更好的性能(759毫秒处理包含150,000 <Article>个条目的文件)。不过,不确定它是否足够快以容纳1500万个条目。

from lxml import etree

dom = etree.parse('input.xml')
root = dom.getroot()

for article in dom.xpath('Article[Affiliation="-"]'):
    root.remove(article)

dom.write('output.xml')

答案 1 :(得分:-1)

假设您以字符串形式接收此页面,并将其称为html,则可以使用以下代码运行您的逻辑。这个想法是首先收集您Articles标签的位置,然后检查标签“ Affiliation”是否仅由“-”组成

def removeFromText(html, tag, position):
    article = html[position[0]:position[1]]
    beginning = article.find("<" + tag + ">") + len("<" + tag + ">") + position[0]
    end = article.find("</" + tag + ">") + position[0]
    print(beginning, end)
    affiliation = html[beginning:end]
    print(affiliation)

    if beginning != -1 and html[beginning:end] == "-": 
        return html[:position[0]] + html[position[1]:] 
    return html

query = "Article"
start = 0
positions = []

while True:
    foundOpen = html.find("<"+query+">", start)
    if foundOpen == -1: break

    foundClose = html.find("</"+query+">", start) + len("</"+query+">")
    positions.append((foundOpen, foundClose))
    start = foundClose

for (opening,closing) in reversed(positions):
    print(opening, closing)
    html = removeFromText(html, "Affiliation", (opening,closing))

现在,您的html变量在过滤后正在存储最终信息