示例XML文件
<ArticleSet>
<Article>
<ForeName>a</ForeName>
<LastName>b</LastName>
<Affiliation>harvard university of science. abc@gmail.com</Affiliation>
</Article>
<Article>
<ForeName>a</ForeName>
<LastName>b</LastName>
<Affiliation>-</Affiliation>
</Article>
<Article>
<ForeName>a</ForeName>
<LastName>b</LastName>
<Affiliation>harvard university of science. ghi@yahoo.co.in</Affiliation>
</Article>
</ArticleSet>
我要删除所有<Affliation>-</Affliation>
所需的输出
<ArticleSet>
<Article>
<ForeName>a</ForeName>
<LastName>b</LastName>
<Affiliation>harvard university of science. abc@gmail.com</Affiliation>
</Article>
<Article>
<ForeName>a</ForeName>
<LastName>b</LastName>
<Affiliation>harvard university of science. ghi@yahoo.co.in</Affiliation>
</Article>
</ArticleSet>
答案 0 :(得分:0)
这将从input.xml
中读取XML,并将修改后的文档写入output.xml
:
import xml.etree.ElementTree as ET
dom = ET.parse('input.xml')
root = dom.getroot()
for article in root.findall('Article'):
if article.find('Affiliation').text == '-':
root.remove(article)
dom.write('output.xml')
编辑:使用lxml
,我获得了明显更好的性能(759毫秒处理包含150,000 <Article>
个条目的文件)。不过,不确定它是否足够快以容纳1500万个条目。
from lxml import etree
dom = etree.parse('input.xml')
root = dom.getroot()
for article in dom.xpath('Article[Affiliation="-"]'):
root.remove(article)
dom.write('output.xml')
答案 1 :(得分:-1)
假设您以字符串形式接收此页面,并将其称为html,则可以使用以下代码运行您的逻辑。这个想法是首先收集您Articles标签的位置,然后检查标签“ Affiliation”是否仅由“-”组成
def removeFromText(html, tag, position):
article = html[position[0]:position[1]]
beginning = article.find("<" + tag + ">") + len("<" + tag + ">") + position[0]
end = article.find("</" + tag + ">") + position[0]
print(beginning, end)
affiliation = html[beginning:end]
print(affiliation)
if beginning != -1 and html[beginning:end] == "-":
return html[:position[0]] + html[position[1]:]
return html
query = "Article"
start = 0
positions = []
while True:
foundOpen = html.find("<"+query+">", start)
if foundOpen == -1: break
foundClose = html.find("</"+query+">", start) + len("</"+query+">")
positions.append((foundOpen, foundClose))
start = foundClose
for (opening,closing) in reversed(positions):
print(opening, closing)
html = removeFromText(html, "Affiliation", (opening,closing))
现在,您的html
变量在过滤后正在存储最终信息