解析dblp xml文件

时间:2018-04-13 13:41:45

标签: python xml parsing xml-parsing

dblp.xml文件(https://dblp.uni-trier.de/faq/What+do+I+find+in+dblp+xml.html)中的数据如下所示:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>

[...]

<article key="journals/cacm/Gentry10" mdate="2010-04-26">
<author>Craig Gentry</author>
<title>Computing arbitrary functions of encrypted data.</title>
<pages>97-105</pages>
<year>2010</year>
<volume>53</volume>
<journal>Commun. ACM</journal>
<number>3</number>
<ee>http://doi.acm.org/10.1145/1666420.1666444</ee>
<url>db/journals/cacm/cacm53.html#Gentry10</url>
</article>

[...]

<inproceedings key="conf/focs/Yao82a" mdate="2011-10-19">
<title>Theory and Applications of Trapdoor Functions (Extended Abstract)</title>
<author>Andrew Chi-Chih Yao</author>
<pages>80-91</pages>
<crossref>conf/focs/FOCS23</crossref>
<year>1982</year>
<booktitle>FOCS</booktitle>
<url>db/conf/focs/focs82.html#Yao82a</url>
<ee>http://doi.ieeecomputersociety.org/10.1109/SFCS.1982.45</ee>
</inproceedings>

[...]

<www mdate="2004-03-23" key="homepages/g/OdedGoldreich">
<author>Oded Goldreich</author>
<title>Home Page</title>
<url>http://www.wisdom.weizmann.ac.il/~oded/</url>
</www>

[...]
</dblp>

我解析xml文件的代码如下:

#!/usr/bin/env python

import sys

from lxml import etree

CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', \
                  'incollection', 'phdthesis', 'mastersthesis', 'www'])
DATA_ITEMS = ['title', 'booktitle', 'year', 'journal', 'ee','url']
TABLE_SCHEMA = ['element', 'mdate', 'dblpkey', 'title', 'booktitle', \
                'year', 'journal', 'ee','url']


def write_output(paper, authors):
    arranged_fields = []
    for field in TABLE_SCHEMA:
        if field in paper and paper[field] is not None:
            arranged_fields.append(paper[field].encode('utf-8'))
        else:
            arranged_fields.append('')
    for author in authors:
            print('\t'.join(arranged_fields) + '\t' + author)


def clear_element(element):
    element.clear()
    while element.getprevious() is not None:
        del element.getparent()[0]


def extract_paper_elements(context):
    for event, element in context:
         if element.tag in CATEGORIES:
               yield element
               clear_element(element)


def fast_iter2(context):
    for element in extract_paper_elements(context):
        authors = []
        for author in element.findall('author'):
            if author is not None and author.text is not None:
                authors.append(author.text.encode('utf-8'))
            paper = {
                'element' : element.tag,
                'mdate' : element.get('mdate'),
                'dblpkey' : element.get('key')
            }
            for data_item in DATA_ITEMS:
                 data = element.find(data_item)
                 if data is not None:
                     paper[data_item] = data.text
        write_output(paper, authors)


def main():
    # Accept command line arguments
    if len(sys.argv) == 1:
       fin = sys.stdin
    elif len(sys.argv) == 2:
       fin = sys.argv[1]
    else:
       sys.stderr.write('usage: ' + sys.argv[0] + ' <input xml file>\n')
       return
    # Parse xml input file
    context = etree.iterparse(fin, dtd_validation=True, events=('start', 'end'))
    fast_iter2(context)


if __name__=='__main__':
    main()

我有兴趣找到与作者联系的网址,可以在切片中找到

<www mdate=" ......"
......
</www>

我试过的代码只返回为作者找到的第一个网址。 例如,对于xml文件中的以下xml片:

<www mdate="2016-06-01" key="homepages/127/6548">
<author>Emanuele D'Osualdo</author>
<title>Home Page</title>
<url>http://emanueledosualdo.com</url>
<url>http://concurrency.informatik.uni-kl.de/group/dosualdo/home.html</url>
<url>http://www.cs.ox.ac.uk/people/emanuele.dosualdo/</url>
<url>https://scholar.google.com/citations?user=xH4XRWIAAAAJ</url>
<url>https://de.linkedin.com/pub/emanuele-d-osualdo/7/a36/440</url>
<url>https://twitter.com/bordaigorl</url>
<note type="affiliation">Techical University of Kaiserslautern, Department of Computer Science</note>
<note type="affiliation">Oxford University, Department of Computer Science</note>
</www>

我的代码只返回:

[&#39;&#39;,&#39; 2016-06-01&#39;,&#39;主页/ 127/6548&#39;,&#39;主页&#39;,& #39;&#39;,&#39;&#39;,&#39;&#39;,&#39;&#39;,&#39; http://emanueledosualdo.com&#39;,& #34; Emanuele D&#39; Osualdo \ n&#34;]

我应该在代码中更改哪些内容,以便获得与作者相关的所有链接(&#34; Emanuele D&#39; Osualdo \ n&#34;在这种情况下)?

1 个答案:

答案 0 :(得分:0)

如果您只想连接网址,可以使用“findall”替换你的fast_iter2函数:

def fast_iter2(context):
    for element in extract_paper_elements(context):
        authors = []
        for author in element.findall('author'):
            if author is not None and author.text is not None:
                authors.append(author.text.encode('utf-8'))
            paper = {
                'element' : element.tag,
                'mdate' : element.get('mdate'),
                'dblpkey' : element.get('key')
            }
            for data_item in DATA_ITEMS:
                 items_concatenated = ""
                 for data in element.findall(data_item):
                     items_concatenated+=data.text+";"
                 if items_concatenated != "":
                     paper[data_item] = items_concatenated[0:-1]
        write_output(paper, authors)

请注意,这将连接其他数据项,而不仅仅是de URL。如果您只想连接URL,可以修改代码,添加更多逻辑。