使用模式从xml创建python dict时的内存使用情况

时间:2014-03-06 08:09:59

标签: python xml dictionary

我有一个很大的xml(40 MB)并使用以下函数将其解析为dict

    def get_title_year(xml,low,high):
        """
        Given an XML document extract the title and year of each article.
        Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow 
        """
        dom = web.Element(xml)
        result = {'title':[],'publication year':[]}
        count = 0
        for article in dom.by_tag('article'):
            year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1]) 
            if low < year < high:
                result['title'].append(article.by_tag('title')[0].content)
                result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
        return result

    ty_dict = get_title_year(PR_file,1912,1970)
    ty_df = pd.DataFrame(ty_dict)
    print ty_df.head()

       publication year                                              title
    0              1913  The Velocity of Electrons in the Photo-electri...
    1              1913  Announcement of the Transfer of the Review to ...
    2              1913  Diffraction and Secondary Radiation with Elect...
    3              1913      On the Comparative Absorption of γ and X Rays
    4              1913             Study of Resistance of Carbon Contacts

当我运行它时,我最终使用2.5 GB的RAM!两个问题:

所有这些RAM在哪里使用?它不是字典或DataFrame,当我将数据帧保存为utf8 csv时,它只有3.4 MB。

此外,函数完成后不会释放RAM。这是正常的吗?我以前从未关注python内存的使用,所以我不能说。

1 个答案:

答案 0 :(得分:0)

这只回答了关于在函数结束时释放内存的部分。请参阅上面的Wojciech Walczak的评论和链接!我在这里发布代码因为我发现在我的情况下(Ubuntu 12.04)在赋值p.join()之前放置ty_dict = q.get()语句(如在原始链接中)导致代码死锁,请参阅{{3 }。

    from multiprocessing import Process, Queue

    def get_title_year(xml,low,high,q):
        """
        Given an XML document extract the title and year of each article.
        Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow 
        """
        dom = web.Element(xml)
        result = {'title':[],'publication year':[]}
        for article in dom.by_tag('article'):
            year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1]) 
            if low < year < high:
                result['title'].append(article.by_tag('title')[0].content)
                result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
        q.put(result)

    q = Queue()
    p = Process(target=get_title_year, args=(PR_file,1912,1970, q))
    p.start()
    ty_dict = q.get()
    p.join()
    if p.is_alive():
        p.terminate()

使用此版本,内存将被释放回语句末尾的操作系统。