使用numpy

时间:2017-02-28 13:26:58

标签: python numpy parallel-processing

我有一个大型(~100GB)数据集xs的结构化numpy数组x,我希望用属性p1对每个数据进行bin,并希望找到平均值和标准差每个箱子中的财产p2。我的方法如下所示,但速度很慢。有没有更快/更多的numpythonic方式来做到这一点?我不能将整个数据集放在内存中,但我确实有很多内核,因此将它并行化的好方法也很不错。

nbins=30
bin_edges=np.linspace(0,1,nbins) 

N, p2_total, means_p2, stds_p2 = np.zeros((4,nbins))      

for x in xs_generator():
    p1s = x['p1']
    p2s = x['p2']

    which_bin=np.digitize(p1s,bins=bin_edges)

    for this_bin,bin_edge in enumerate(bin_edges):
        these_p1s    = p1s[which_bin==this_bin]
        these_p2s    = p2s[which_bin==this_bin]

        N[this_bin]          += np.size(these_p1s)
        p2_total[this_bin]   += np.sum(these_p2s)
        p2sq_total[this_bin] += np.sum(these_p2s**2)

means_p2 = p2_total/N
stds_p2  = np.sqrt(p2sq_total/N**2)

1 个答案:

答案 0 :(得分:1)

  • 你应该使用np.histogram:

    N, binDump = np.histogram( p1s, bins=bin_edges )
    p2_total, binDump = np.histogram( p1s, bins=bin_edges, weights=p2s )
    p2sq_total, binDump = np.histogram( p1s, bins=bin_edges, weights=p2s**2 )
    
    means_p2 = p2_total/N
    stds_p2  = np.sqrt(p2sq_total/N**2)
    
像这样你避免循环,你只需重新编写直方图函数:)