Question

阅读2.5GB .osm文件。进程大约需要15分钟，大约4GB RAM（使用64位版本）。在完成所有行并且print count_nodes-count变为零RAM后，RAM（也是HDD）并且PC正在冻结。它从不打印print'last step-closing',("--- %s seconds ---" % (time.time() - start_time))

执行会发生什么？有什么建议可以避免吗？

我的代码：

import time
import xml.etree.ElementTree as etree

file=('california.osm')    
context=etree.iterparse(file)

start_time = time.time()
localtime = time.asctime( time.localtime(time.time()) )
print "Start time :", localtime

count_nodes=6132755
count=0
list=[]
with open('new_file.txt','w') as f:
    for event, elem in context:
        dict = {}
        if elem.tag == "node":
            count+=1            
            lat=elem.get('lat')
            lon=elem.get('lon') 
            dict['lat']=lat
            dict['lon']=lon     
            for child in elem:          
                key=child.get('k')
                val=child.get('v')
                dict[key]=val           
                child.clear()                   
            elem.clear()                            
            if len(dict)>2:
                i=str(dict)                 
                f.write(i)
                f.write('\n')
            print count_nodes-count

print'last step-closing',("--- %s seconds ---" % (time.time() - start_time))
f.close

Answer 1

我认为python正在刷新缓冲区并将数据写入硬盘驱动器（进入f）。尝试在打印最后一步后添加以下行...：

sys.stdout.flush()

不要忘记import sys。如果它太慢，请将语言更改为更快，如C ++甚至Java。你也有XML解析器，除非你做的是依赖python的东西，否则它对大数据更好。

或尝试使用Python imposm

等现有解析器

Answer 2

你为什么要像在属性中那样在底部写下f.close？你可以删除它，一旦控件离开＆＃34;打开＆＃34;文件就已经关闭了。声明。我同意这里的asalic，数据可能正在刷新。但是，对于python来说，这似乎是一项非常可行的任务。

由于您使用的是iterparse（），因此我不确定在完成元素后清除这些元素是否真的能让您获得速度方面的任何信息。话虽这么说，你应该删除你的中间变量，每个循环只做一个文件写，如下所示：

dict['lat'] = elem.get('lat')
dict['lon'] = elem.get('lon')
for child in elem:
    dict[child.get('k')] = child.get('v')
    if len(dict) > 2:
        f.write("%s\n" % str(dict))

此外，您应该跳过print语句，因为数据集相当大。

需要建议来优化Python 2.7code

2 个答案: