我有一个地理数据的数据集,我正试图平滑。为此,我在每个行的某个半径r内找到所有最近邻居,然后选择这些行并取平均值并将其作为列添加到原始数据帧。这样做的代码是
import pandas as pd
import numpy as np
import scipy.spatial as spatial
d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)
factor = ["factor1", "factor2"]
dist = [2,1.5]
X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)
for i in dist:
for j in factor:
df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)
这目前工作正常,但需要时间,因为它必须遍历每个功能以平均它。然而,由于我已经找到了最近的邻居(需要时间的位),可能有一些方法可以选择所有最近的邻居行并一次平均所有列并将它们添加到数据集中,但我无法弄清楚如何/如果可以的话。我已经尝试找到每行的最近邻居的所有标记并将它们存储在i循环内的数据集中,但这会占用大量内存和崩溃。
我觉得这可以做得更好
答案 0 :(得分:0)
我通过使用列表理解来看到一个小的(约20%)改进。
但请查看它如何与您的完整数据集进行缩放。
import pandas as pd
import numpy as np
import scipy.spatial as spatial
d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)
factor = ["factor1", "factor2"]
dist = [2,1.5]
X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)
def original(df):
for i in dist:
for j in factor:
df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)
return df
def jp(df):
calc = tree.query_ball_point
for i in dist:
for j in factor:
df_filter = df[j]
df[j + "_Mean_" + str(i)] = [df_filter[calc([x, y],i)].mean() for x, y in zip(df['x'], df['y'])]
return df
%timeit original(df) # 100 loops, best of 3: 13.1 ms per loop
%timeit jp(df) # 100 loops, best of 3: 10.9 ms per loop