ICDAR 2009数据集包含xml格式的groundtruth:
<?xml version="1.0" encoding="UTF-8"?>
<bs-submission participant-id="0"
run-id="GROUNDTRUTH"
task="book-toc"
toc-creation="semi-automatic"
toc-source="full-content">
<source-files xml="no" pdf="no" />
<description>
This file contains the annotated groundtruth file (ideal ToCs), manually and collaboratively built by the participants of the ICDAR Structure Extraction competition 2009 and used for evaluation.
</description>
<book>
<bookid>049AA21392135223</bookid>
<toc-section page="11" /><toc-entry title="I. Introduction" page="15" />
<toc-entry title="II. List of the skeletal remains" page="20" />
<toc-entry title="III. The New Orleans skeleton" page="21" />
<toc-entry title="IV. The Quebec skeleton" page="22" />
<toc-entry title="V. The Natchez pelvic bone" page="22" />
<toc-entry title="VI. The Lake Monroe (Florida) bones" page="25" />
<toc-entry title="VII. The Soda Creek skeleton" page="26" />
<toc-entry title="VIII. The Charleston bones" page="26" />
<toc-entry title="IX. The Calaveras skull" page="27">
<toc-entry title="History" page="27" />
<toc-entry title="Physical characters." page="28" />
<toc-entry title="Comparisons" page="33" />
</toc-entry>
<toc-entry title="X. The Rock Bluff cranium" page="36" />
<toc-entry title="XI. The Man of Penon" page="42" />
<toc-entry title="XII. The crania of Trenton" page="45">
<toc-entry title="The Burlington County skull" page="46" />
<toc-entry title="The Riverview Cemetery skull" page="46" />
<toc-entry title="Racial affinities of the Burlington County and Riverview Cemetery skulls" page="55" />
</toc-entry>
<toc-entry title="XIII. The Trenton femur" page="60" />
<toc-entry title="XIV. The Lansing skeleton" page="61">
<toc-entry title="Somatological characters" page="62" />
<toc-entry title="Conclusion" page="68" />
</toc-entry>
<toc-entry title="XV. The fossil man of western Florida" page="69">
<toc-entry title="The Osprey skull" page="69" />
<toc-entry title="The North Osprey bones" page="70" />
<toc-entry title="The Hanson Landing remains" page="71" />
<toc-entry title="The South Osprey remains" page="71" />
<toc-entry title="Examination of the specimens" page="72" />
<toc-entry title="Physical characters" page="75" />
<toc-entry title="Resume" page="82">
<toc-entry title="Report of Dr. T. Way land Vaughan" page="86" />
</toc-entry>
</toc-entry>
<toc-entry title="XVI. Mound crania (Florida)" page="90" />
<toc-entry title="XVII. The Nebraska "loess man"" page="90">
<toc-entry title="History of finds" page="91" />
<toc-entry title="Description of the mound" page="98" />
<toc-entry title="Examination of the bones" page="100" />
<toc-entry title="Discussion" page="115" />
</toc-entry>
<toc-entry title="XVIII. General conclusion" page="130" />
<toc-entry title="XIX. Appendix: Recent Indian skulls of low type in the U.S. National Museum" page="147" />
<toc-entry title="Index" page="157" />
</book>
</bs-submission>
在这个大的xmlfile中,一些<book>
元素有一个名为<toc-section>
的子元素。
我想迭代所有<book>
以查看是否有一些不包含此类子项。我如何在python中执行此操作,例如使用lxml.html
?
这是我的剧本的开头:
with open(icdaf_xmlfile) as infile:
icdar2013_tree_string = infile.read()
root = lxml.html.fromstring(icdar2013_tree_string)
for book in root.iter('book'):
# check if book contains toc-section
答案 0 :(得分:1)
我想迭代所有
<book>
以查看是否有一些不包含此类子项。
在XPath中这很容易(你使用的是lxml,所以XPath不是问题)
for book in root.xpath(".//book[not(toc-section)]"):
# this book has no <toc-section> children
pass
替代地
for book in root.xpath(".//book"):
if not book.xpath("./toc-section"):
# this book has no <toc-section> children
pass
答案 1 :(得分:0)
这应该有所帮助。
from lxml import etree as ET
root = ET.fromstring(html)
for elem in root.findall("book"): #Iterate over book tags
if elem.find("toc-section") is None: #Check if `toc-section` is in book tag
print("toc-section not found")