解析xml文件花费的时间太长

时间:2020-04-29 14:22:52

标签: python xml database

我有一个很大的dblp.xml文件(2.8Gb),我想在一个单独的txt文件中写入元素“ year”。该txt文件上的year大约为6milliom。我创建了一个代码,但是它也需要还有很长的路要走完这个过程。是否有其他方法可以使过程更快,或者至少写了一半的日期? 一小段xml:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
<author>Gerd Hoff</author>

代码:

sys.stdout = TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

tokenizer = RegexpTokenizer(r'\w+')
with open('tags.txt') as f:
collaborations = f.read().splitlines()

def fast_iter(context):
    year = ''
    for event, elem in context:
    if elem.tag == 'year':
        if elem.text:
            year = elem.text
    if elem.tag in collaborations:
        if year:
            year = int(year)
            print('{:d}'.format(year), end='')
            print(flush=True)
            year = ''
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context


if __name__ == "__main__":
context = etree.iterparse('dblp-2020-04-01.xml', load_dtd=True, html=True)
fast_iter(context)

1 个答案:

答案 0 :(得分:0)

此代码会将所有年份写入文本文件(在year标记之间)。

    import re
    xml_name = 'dblp.xml'
    txt_file = 'tags.txt'
    with open(xml_name,'r') as f:
        years = f.read()
    years = re.findall('<year>(.*?)</year>',years)
    with open(txt_file, 'w') as f:
        for year in years:
            f.write(year+'\n')