问题:如何为地理数据框中的所有元素选择“ n个满足条件的最近邻居”?
示例: “对于森林中的所有树木,半径100 m以内的两个最高松树的高度是多少?” (请注意,“树”不一定是“松树”。)
如果我只想要每棵树最近的邻居树,我可以使用
libpysal.weights.KNN.from_dataframe(df_g, k=2, radius=100)
(考虑到地理数据框)
我正在寻找一种方法来获取最符合条件的邻居。
可行示例
这段代码定义了一个具有9个点的地理数据框:
import pandas as pd, libpysal, geopandas as gp,matplotlib.pyplot as plt
from shapely.wkt import loads
# 18 points with values and types
points=['POINT (0.1 0.2)','POINT (-1 0)','POINT (1 0)','POINT (0 -1)','POINT (0 1)','POINT (-2 0)','POINT (2 0)','POINT (0 -2)','POINT (0 2)']
values=[9,8,7,6,5,4,3,2,1]
types=[0,0,0,0,0,1,1,1,1]
df=pd.DataFrame({'points':points,'value':values,'types':types})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points])
我想寻找类型2为1的半径范围内的邻居。
因此,对于中心点,我想在橙色点而不是蓝色点中寻找邻居:
如果类型不是问题,我可以遍历最近的邻居,例如:
knn2 = libpysal.weights.KNN.from_dataframe(gdf, k=2,radius=2)
for index,row in gdf.iterrows(): # Looping over all points
knn_neighbors = knn2.neighbors[index] # Get neighbors
knnsubset = gdf.iloc[knn_neighbors] # Get subdataframe
print("Mean: ",knnsubset['value'].mean()) # Calculating mean of 'value'
对于中心点,将选择两个绿色点,如图所示:
但是,我只考虑橙色点。
简单的“修复”:
我当然可以选择“足够”的邻居,然后再过滤它们:
knn2 = libpysal.weights.KNN.from_dataframe(gdf, k=8,radius=2) # Select enough neighbors
for index,row in gdf.iterrows(): # Looping over all points
knn_neighbors = knn2.neighbors[index] # Get neighbors
knnsubset = gdf.iloc[knn_neighbors] # Get subdataframe
knnsubset=knnsubset[knnsubset.types==1].head(2) #Require type 1 and take the two first
print("Mean: ",knnsubset['value'].mean()) # Calculating mean of 'value'
如图所示,它选择了正确的点。但是,有两个问题:
这似乎是某个人会在某个时候解决的问题。有指针吗?我应该在想sklearn吗?
答案 0 :(得分:0)
如评论中所述,您可以使用DistanceBand而不是KNN进行第一次过滤(因为在KNN中,您实际上不知道要选择多少个至少2个松树,您必须选择多少个)。
从上述gdf开始:
W = libpysal.weights.DistanceBand.from_dataframe(gdf, threshold=2)
results = [] # it is faster to save to list than directly to gdf
for index, row in gdf.iterrows():
neighbors = W.neighbors[index]
subset = gdf.loc[neighbors] # get all within threshold
limited = subset.loc[subset.types == 1] # limit to type you want - here you can do more filtering
if len(limited) > 0: # if something of type 1 is close enough
limited['dist'] = limited.geometry.distance(row.geometry) #get distances
limited.nsmallest(n=2, columns='dist')
results.append(list(limited.index)) # save indices of selection
else:
results.append(None)
gdf['result'] = results # save to gdf
这将保存满足条件的点的索引(或无)。
不过,我不确定它在大型数据集中的表现如何,但熊猫过滤应该不是问题,并且可以向量化到有限df的距离。