pandas:快速计算具有特定值的列的总和

时间:2018-01-26 13:36:40

标签: python pandas subset

我有一个pandas数据帧,我需要计算属于某个窗口的一列值的总和。例如,如果我有一个500的窗口,并且我的初始值是1000,我想要将所有值加在499和999之间,以及介于1001和1501之间。

使用某些数据更容易解释:

    chrom   pos end AFR EUR pi
0   1   10177   10177   0.4909  0.4056  0.495988
1   1   10352   10352   0.4788  0.4264  0.496369
2   1   10617   10617   0.9894  0.9940  0.017083
3   1   11008   11008   0.1346  0.0885  0.203142
4   1   11012   11012   0.1346  0.0885  0.203142
5   1   13110   13110   0.0053  0.0567  0.053532
6   1   13116   13116   0.0295  0.1869  0.176091
7   1   13118   13118   0.0295  0.1869  0.176091
8   1   13273   13273   0.0204  0.1471  0.139066
9   1   13550   13550   0.0008  0.0080  0.007795
10  1   14464   14464   0.0144  0.1859  0.161422
11  1   14599   14599   0.1210  0.1610  0.238427
12  1   14604   14604   0.1210  0.1610  0.238427
13  1   14930   14930   0.4811  0.5209  0.500209
14  1   14933   14933   0.0015  0.0507  0.044505
15  1   15211   15211   0.5371  0.7316  0.470848
16  1   15585   15585   0.0008  0.0020  0.002635
17  1   15644   15644   0.0008  0.0080  0.007795
18  1   15777   15777   0.0159  0.0149  0.030470
19  1   15820   15820   0.4849  0.2714  0.477153
20  1   15903   15903   0.0431  0.4652  0.349452
21  1   16071   16071   0.0091  0.0010  0.011142
22  1   16142   16142   0.0053  0.0020  0.007721
23  1   16949   16949   0.0227  0.0159  0.038759
24  1   18643   18643   0.0023  0.0080  0.009485
25  1   18849   18849   0.8411  0.9911  0.170532
26  2   30923   30923   0.6687  0.9364  0.338400
27  2   20286   46286   0.0053  0.0010  0.006863
28  2   21698   46698   0.0015  0.0010  0.002566
29  2   42159   47159   0.0083  0.0696  0.067187

所以我需要根据前两列进行子集化。例如,如果我的窗口= 500,我的chrom = 1且我的pos = 15500,我将需要将我的df子集以仅包括那些具有chrom = 1和15000>的行。 pos< 16000

然后我想总结一下这个数据子集的AFR列。

这是我所做的功能:

#vdf is my main dataframe, 
#polyChrom is the chromosome to subset by,     
#polyPos is the position to subset by. 
#Distance is how far the window should be from the polyPos. 
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is 
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):

#If start position becomes less than 0, set it to 0
    if(polyPos - distance < 0):
        start1 = 0
        end1 = windowSize
    else:
        start1 = polyPos - distance
        end1 = start1 + windowSize

    end2 = polyPos + distance
    start2 = end2 - windowSize

#subset df        
    df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
                 ((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()

    return(df.AFR.sum())

这整个方法适用于数据帧的子集化,当我的数据帧包含~55k行时非常慢。有没有更快更有效的方法呢?

1 个答案:

答案 0 :(得分:1)

诀窍是下拉到numpy数组。 Pandas索引和切片很慢。

import pandas as pd

df = pd.DataFrame([[1, 10177, 0.5], [1, 10178, 0.2], [1, 20178, 0.1],
                   [2, 10180, 0.3], [1, 10180, 0.4]], columns=['chrom', 'pos', 'AFR'])

chrom = df['chrom'].values
pos = df['pos'].values
afr = df['AFR'].values

def filter_sum(chrom_arr, pos_arr, afr_arr, chrom_val, pos_start, pos_end):
    return sum(k for i, j, k in zip(chrom_arr, pos_arr, afr_arr) \
               if pos_start < j < pos_end and i == chrom_val)

filter_sum(chrom, pos, afr, 1, 10150, 10200)

# 1.1