我尝试从Geonames创建位置名称和信息的字典,以便在读取文档,提取位置名称和输出其信息的程序中使用。键是位置名称,纬度和经度的元组列表,国家代码,要素类和与每个名称对应的GeoName ID(因为可以有多个具有相同名称的位置)是值。以下是字典的示例摘录:
{'xixerella': [(('42.55327', '1.48736'), 'AD', 'PPL', '3038816'), (('42.55294', '1.48764'), 'AD', 'ADMD', '3038817')], 'fonts vives': [(('42.5', '1.56667'), 'AD', 'SPNG', '3038822')], 'roc del xeig': [(('42.56667', '1.48333'), 'AD', 'RK', '3038820')], 'costa de xurius': [(('42.5', '1.48333'), 'AD', 'SLP', '3038814')]}
最后的字典有9,088,105个键。当我尝试将其转储到带有pickle的文件中以便我可以在其他程序中引用它时,它会抛出此错误:
Python(763,0xa03871a8) malloc: *** mach_vm_map(size=50331648) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "/Applications/Wing101.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 31, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1370, in dump
Pickler(file, protocol).dump(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 615, in _batch_appends
save(x)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 581, in save_tuple
self.memoize(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 247, in memoize
self.memo[id(obj)] = memo_len, obj
MemoryError:
我应该使用的是数据结构而不是字典吗?我该怎么做才能减少内存使用量?
这是我的程序:
import csv
import sys
import pickle
geodict = {}
ignore = ["", " ", " ", " ", "-", " -", "- ", " - "]
csv.field_size_limit(sys.maxsize)
reader = csv.reader(open('allCountries-2.txt', 'rb'), delimiter='\t')
for row in reader:
loc = []
loc.append(row[2].lower())
if row[3] != '':
altnames = row[3].split(',')
for entry in altnames:
entry = "".join(x for x in entry if ord(x)<128)
entry = entry.lower()
if entry not in loc:
if entry not in ignore:
loc.append(entry)
geoid = row[0]
latlong = (row[4], row[5])
feature = row[7]
country = row[8]
for name in loc:
if name in geodict:
geodict[name].append((latlong, country, feature, geoid))
else:
geodict[name] = [(latlong, country, feature, geoid)]
with open('dict.txt', 'wb') as handle:
pickle.dump(geodict, handle)
如果你不熟悉Geonames文件的格式/内容:它是一个1.14 GB制表符分隔的文本文件,row [2]是纯ASCII字符中的位置名称,row [3]是替代位置名称(有时没有alt名称;我剥离非ASCII bc有一些疯狂的重音字符,Python不喜欢的中文/日文/等字符)。如果还有其他不清楚的地方,请问。
请帮忙!谢谢!
答案 0 :(得分:0)
在处理大的数据结构时,您应该切换到streaming pickle。它的工作方式与常规pickle非常相似,但是以流(增量)方式加载/保存,因此使用的内存要少得多。