如何验证xml文件中特定元素的存在?

时间:2018-06-06 09:18:03

标签: xml python-3.x xml-parsing lxml

ICDAR 2009数据集包含xml格式的groundtruth:

<?xml version="1.0" encoding="UTF-8"?>
<bs-submission participant-id="0"
  run-id="GROUNDTRUTH"
  task="book-toc" 
  toc-creation="semi-automatic" 
  toc-source="full-content">
<source-files xml="no" pdf="no" />
<description>
This file contains the annotated groundtruth file (ideal ToCs), manually and collaboratively built by the participants of the ICDAR Structure Extraction competition 2009 and used for evaluation.
</description>
<book>
<bookid>049AA21392135223</bookid>
<toc-section page="11" /><toc-entry title="I. Introduction" page="15" />
<toc-entry title="II. List of the skeletal remains" page="20" />
<toc-entry title="III. The New Orleans skeleton" page="21" />
<toc-entry title="IV. The Quebec skeleton" page="22" />
<toc-entry title="V. The Natchez pelvic bone" page="22" />
<toc-entry title="VI. The Lake Monroe (Florida) bones" page="25" />
<toc-entry title="VII. The Soda Creek skeleton" page="26" />
<toc-entry title="VIII. The Charleston bones" page="26" />
<toc-entry title="IX. The Calaveras skull" page="27">
<toc-entry title="History" page="27" />
<toc-entry title="Physical characters." page="28" />
<toc-entry title="Comparisons" page="33" />
</toc-entry>
<toc-entry title="X. The Rock Bluff cranium" page="36" />
<toc-entry title="XI. The Man of Penon" page="42" />
<toc-entry title="XII. The crania of Trenton" page="45">
<toc-entry title="The Burlington County skull" page="46" />
<toc-entry title="The Riverview Cemetery skull" page="46" />
<toc-entry title="Racial affinities of the Burlington County and Riverview Cemetery skulls" page="55" />
</toc-entry>
<toc-entry title="XIII. The Trenton femur" page="60" />
<toc-entry title="XIV. The Lansing skeleton" page="61">
<toc-entry title="Somatological characters" page="62" />
<toc-entry title="Conclusion" page="68" />
</toc-entry>
<toc-entry title="XV. The fossil man of western Florida" page="69">
<toc-entry title="The Osprey skull" page="69" />
<toc-entry title="The North Osprey bones" page="70" />
<toc-entry title="The Hanson Landing remains" page="71" />
<toc-entry title="The South Osprey remains" page="71" />
<toc-entry title="Examination of the specimens" page="72" />
<toc-entry title="Physical characters" page="75" />
<toc-entry title="Resume" page="82">
<toc-entry title="Report of Dr. T. Way land Vaughan" page="86" />
</toc-entry>
</toc-entry>
<toc-entry title="XVI. Mound crania (Florida)" page="90" />
<toc-entry title="XVII. The Nebraska &quot;loess man&quot;" page="90">
<toc-entry title="History of finds" page="91" />
<toc-entry title="Description of the mound" page="98" />
<toc-entry title="Examination of the bones" page="100" />
<toc-entry title="Discussion" page="115" />
</toc-entry>
<toc-entry title="XVIII. General conclusion" page="130" />
<toc-entry title="XIX. Appendix: Recent Indian skulls of low type in the U.S. National Museum" page="147" />
<toc-entry title="Index" page="157" />
</book>
</bs-submission>

在这个大的xmlfile中,一些<book>元素有一个名为<toc-section>的子元素。

我想迭代所有<book>以查看是否有一些不包含此类子项。我如何在python中执行此操作,例如使用lxml.html

这是我的剧本的开头:

with open(icdaf_xmlfile) as infile:
          icdar2013_tree_string = infile.read()

root = lxml.html.fromstring(icdar2013_tree_string)

for book in root.iter('book'):
     # check if book contains toc-section

2 个答案:

答案 0 :(得分:1)

  

我想迭代所有<book>以查看是否有一些不包含此类子项。

在XPath中这很容易(你使用的是lxml,所以XPath不是问题)

for book in root.xpath(".//book[not(toc-section)]"):
    # this book has no <toc-section> children
    pass

替代地

for book in root.xpath(".//book"):
    if not book.xpath("./toc-section"): 
        # this book has no <toc-section> children
        pass

答案 1 :(得分:0)

这应该有所帮助。

from lxml import etree as ET

root = ET.fromstring(html)
for elem in root.findall("book"):      #Iterate over book tags
    if elem.find("toc-section") is None:   #Check if `toc-section` is in book tag
        print("toc-section not found")