Question

我目前正在使用维度为15,883,912x105的大型数据文件。此数据集的存储大小仅低于10 GB。这是一个相当大的设置，但我正在使用24 GB RAM的盒子。

我正在使用64位Windows机器（确切地说是Dell Precision T3500）。此外，这一切都是通过64位Python安装（来自Anaconda 2.0.0的v2.7.6）完成的。

文件实际读入很好，但是当我将两个变量分配到500,000行以下时，我遇到了MemoryError。具体来说，我试图用以下代码从数据向量计算基尼系数（梯形近似）：

#Isolate test set
gini_test=gini.ix[1979]['adj_market_income']

#Define Gini calculator
def gini_trap(df,var,wt):
    '''Function takes incomes and weights, and returns the *weighted* Gini Coefficient.'''
    #Sort the df by income
    df.sort(columns=var,inplace=True)
    #Generate non-negative version
    inc=df[var].apply(lambda x: max(x,0.))
    #Calculate total weight
    tot_wt=df[wt].sum()
    #Calculate total weighted income
    tot_inc=(df[wt]*inc).sum()
    #Initialize share and cumulative variables
    wt_share=0
    inc_share=0
    cum_wt=0
    cum_inc=0
    #Initialize Gini value
    gini=0
    #For each record...
    for i,val in enumerate(inc):
        #...calculate the current records share of income and population...
        wt_share=df[wt].iloc[i]/tot_wt
        inc_share=(df[wt].iloc[i]*val)/tot_inc
        #...and augment the cumulative measures for income and population/update the gini value...
        cum_wt+=wt_share
        gini+=wt_share*(cum_inc+cum_inc+inc_share)
        cum_inc+=inc_share
    print cum_inc,cum_wt
    return 1-gini

print gini_trap(gini.ix[1979],'adj_market_income','wt')

Stack上有很多MemoryError个问题，但我看到的常见主题是确保使用64位机器并拥有足够的RAM（在进入分块之前等等）。关于这里发生了什么的任何想法？

Python MemoryError w /大量RAM（64位）

0 个答案: