我有一个很大的dblp.xml文件(2.8Gb),我想在一个单独的txt文件中写入元素“ year”。该txt文件上的year大约为6milliom。我创建了一个代码,但是它也需要还有很长的路要走完这个过程。是否有其他方法可以使过程更快,或者至少写了一半的日期? 一小段xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
<author>Gerd Hoff</author>
代码:
sys.stdout = TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
tokenizer = RegexpTokenizer(r'\w+')
with open('tags.txt') as f:
collaborations = f.read().splitlines()
def fast_iter(context):
year = ''
for event, elem in context:
if elem.tag == 'year':
if elem.text:
year = elem.text
if elem.tag in collaborations:
if year:
year = int(year)
print('{:d}'.format(year), end='')
print(flush=True)
year = ''
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
if __name__ == "__main__":
context = etree.iterparse('dblp-2020-04-01.xml', load_dtd=True, html=True)
fast_iter(context)
答案 0 :(得分:0)
此代码会将所有年份写入文本文件(在year标记之间)。
import re
xml_name = 'dblp.xml'
txt_file = 'tags.txt'
with open(xml_name,'r') as f:
years = f.read()
years = re.findall('<year>(.*?)</year>',years)
with open(txt_file, 'w') as f:
for year in years:
f.write(year+'\n')