使用iterparse()解析xml文档

时间:2015-03-16 14:54:59

标签: python xml

我有一个大的xml文件" abcd.xml"几乎800 MB。如果用户输入与作者或标题匹配,我想获得书籍列表的信息。

我用一个小文件完成了它,如何使用iterparse()对大文件进行操作。

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>

编码:

import lxml.etree as ET
data = ET.parse('abcd.xml')
root = ET.fromstring(data)

title = raw_input('enter the name: ')
article = root.xpath('.//article[starts-with(title, "%s")]' % title)[0]

for prop in ['author', 'pages', 'year', 'volume', 'journal']:
    print article.findtext(prop)

输出结构: -

Sanjeev Saxena
Parallel Integer Sorting and Simulation Amongst CRCW Models.
607-619
1996
33
Acta Inf.
........
........
........

1 个答案:

答案 0 :(得分:0)

  1. lxml模块
  2. 解析输入文件
  3. 通过raw_input()
  4. 从用户处获取标题名称
  5. 目标article标记title在第2步中以用户输入开头。
  6. 迭代第3步中的每个article代码
  7. 创建列表元组列表,保存所有文章标签及其文本信息。
  8. 打印结果。
  9. 代码:

    import lxml.etree as ET
    root = ET.parse('input.xml')
    
    title = raw_input('enter the name: ')
    articles = root.xpath('.//article[starts-with(title, "%s")]' % title)
    result = []
    for article in articles:
        tmp = []
        for i in article.getchildren():
            tmp.append((i.tag, i.text))
    
        result.append(tmp)
    
    #- Print result:
    for i in result:
        print "\n"
        for j in i:
            print "%s:%s"%(j[0], j[1])
    

    输出:

    vivek@vivek:~/Desktop/stackoverflow/anna$ python 3.py 
    enter the name: Parallel Integer Sorting and Simulation
    
    
    author:Sanjeev Saxena
    title:Parallel Integer Sorting and Simulation Amongst CRCW Models.
    pages:607-619
    year:1996
    volume:33
    journal:Acta Inf.
    number:7
    url:db/journals/acta/acta33.html#Saxena96
    ee:http://dx.doi.org/10.1007/BF03036466
    
    
    author:Sanjeev Saxena
    title:Parallel Integer Sorting and Simulation Amongst CRCW Models.11
    pages:607-619
    year:1996
    volume:33
    journal:Acta Inf.
    number:7
    url:db/journals/acta/acta33.html#Saxena96
    ee:http://dx.doi.org/10.1007/BF03036466
    
    vivek@vivek:~/Desktop/stackoverflow/anna$