加快附近团体的计算?

时间:2017-07-28 12:26:14

标签: python performance pandas numpy search

我有一个数据框,其中包含一个组ID,两个距离度量(经度/纬度类型度量)和一个值。对于给定的一组距离,我想找到附近其他组的数量,以及附近其他组的平均值。

我已经编写了以下代码,但它的效率非常低,以至于它无法在合理的时间内完成非常大的数据集。附近零售商的计算很快。但是附近零售商平均价值的计算非常慢。有没有更好的方法来提高效率?

dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits

1 个答案:

答案 0 :(得分:6)

很明显,问题是使用 date ---------- 2017-01-15 2017-02-19 2017-03-05 方法索引主数据帧。随着数据帧长度的增长,必须进行更大规模的搜索。我建议您在较小的 date revenue ---------- --------- 2017-01-01 100 2017-01-08 100 2017-01-22 100 2017-01-29 100 2017-01-05 100 2017-01-12 100 2017-02-26 100 2017-03-12 100 数据框上执行相同的搜索,然后计算更新的平均值。

isin

和平均值的公式只是(m1 * n1 + m2 * n2)/(n1 + n2)

df_groups

新设置

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),
                  columns=['Group','Dist1','Dist2','Value'])
distances = [1,2]
# get means of all values and count, the totals for each sample
df_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',
                                                                  'Value':['mean','count']})
# remove multicolumn index
df_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values]
 #Rename columns 
df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':
                          'Count'},inplace=True)


# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])

for i in distances:
    closeby = tree.query_ball_tree(tree, r=i)
    # put into density column
    df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
    #create column to look for subsets
    df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]
    #set this column to prep updated mean calculation
    df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']

    #perform updated mean
    df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /
                                          df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]
    df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
                                 str(i) + '_mean_values']],
                  left_on='Group',
                  right_index=True)