加快agg并加入十亿记录的熊猫桌

时间:2019-02-07 03:58:42

标签: python-3.x pandas

import {SampleComponent} from "../SampleComponent"; <div> <SampleComponent onClick = {?????????}/> </div>

我有以下熊猫数据框

[python 3.5.2, pandas 0.24.1, numpy 1.16.1, scipy 1.2.0]

这些是我正在运行的步骤

data_pd
    nrows: 1,032,749,584
    cols: ['mem_id':np.uint32, 'offset':np.uint16 , 'ctype':string, 'code':string]

obsmap_pd
    nrows: 10,887,542
    cols: ['mem_id':np.uint32, 'obs_id':np.uint32]    
             (obs_id has consecutive integers between 0 and obsmap_pd nrows)

varmap_pd
    nrows: 4,596
    cols: ['ctype':string, 'code': string, 'var_id':np.uint16]   
             (var_id has consecutive integers between 0 and varmap_pd nrows)

这样做的目的是在下一步中创建一个scipy csc_matrix

***
sparse_pd = data_pd.groupby(['mem_id','ctype','code'])['offset'].nunique().reset_index(name='value')
sparse_pd['value'] = sparse_pd['value'].astype(np.uint16)
sparse_pd = pd.merge(pd.merge(sparse_pd, obsmap_pd, on='mem_id', sort=False),
                  varmap_pd, on=['ctype','code'], sort=False)[['obs_id','var_id','value']]
***

创建csc_matrix的速度非常快,但是带有熊猫代码的三行代码(***之间)需要25.7分钟。关于如何加快速度的任何想法?

0 个答案:

没有答案