我有一个很大的xml(40 MB)并使用以下函数将其解析为dict
def get_title_year(xml,low,high):
"""
Given an XML document extract the title and year of each article.
Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow
"""
dom = web.Element(xml)
result = {'title':[],'publication year':[]}
count = 0
for article in dom.by_tag('article'):
year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])
if low < year < high:
result['title'].append(article.by_tag('title')[0].content)
result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
return result
ty_dict = get_title_year(PR_file,1912,1970)
ty_df = pd.DataFrame(ty_dict)
print ty_df.head()
publication year title
0 1913 The Velocity of Electrons in the Photo-electri...
1 1913 Announcement of the Transfer of the Review to ...
2 1913 Diffraction and Secondary Radiation with Elect...
3 1913 On the Comparative Absorption of γ and X Rays
4 1913 Study of Resistance of Carbon Contacts
当我运行它时,我最终使用2.5 GB的RAM!两个问题:
所有这些RAM在哪里使用?它不是字典或DataFrame,当我将数据帧保存为utf8 csv时,它只有3.4 MB。
此外,函数完成后不会释放RAM。这是正常的吗?我以前从未关注python内存的使用,所以我不能说。
答案 0 :(得分:0)
这只回答了关于在函数结束时释放内存的部分。请参阅上面的Wojciech Walczak的评论和链接!我在这里发布代码因为我发现在我的情况下(Ubuntu 12.04)在赋值p.join()
之前放置ty_dict = q.get()
语句(如在原始链接中)导致代码死锁,请参阅{{3 }。
from multiprocessing import Process, Queue
def get_title_year(xml,low,high,q):
"""
Given an XML document extract the title and year of each article.
Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow
"""
dom = web.Element(xml)
result = {'title':[],'publication year':[]}
for article in dom.by_tag('article'):
year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])
if low < year < high:
result['title'].append(article.by_tag('title')[0].content)
result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
q.put(result)
q = Queue()
p = Process(target=get_title_year, args=(PR_file,1912,1970, q))
p.start()
ty_dict = q.get()
p.join()
if p.is_alive():
p.terminate()
使用此版本,内存将被释放回语句末尾的操作系统。