距离矩阵:根据最小和最大距离过滤最近邻居的数量

时间:2019-10-28 20:25:39

标签: python-3.x cluster-analysis nearest-neighbor euclidean-distance distance-matrix

我有一个代码,可在我的数据集中的ID之间生成距离矩阵:

id          5141        5578        5141        5822        5170        5680
id                                                                          
5141    0.000000   47.169906    1.000000  ...   77.524190  134.851770  112.178429
5578   47.169906    0.000000   47.265209  ...  111.521298  127.882759  126.479247
5141    1.000000   47.265209    0.000000  ...   76.661594  135.823415  113.159180
5578   48.166378    1.000000   48.259714  ...  112.294256  128.003906  127.027556
5141    8.602325   54.744863    8.062258  ...   69.771054  141.481448  115.974135
5578   49.162994    2.000000   49.254441  ...  113.070774  128.132744  127.581347
5578   49.091751    2.236068   49.162994  ...  112.445542  129.123971  128.413395
5141   13.928388   60.671245   13.601471  ...   67.230945  143.251527  115.351636
5578   51.088159    4.123106   51.156622  ...  114.017543  129.402473  129.529919
5141   16.278821   63.387696   16.124515  ...   68.007353  142.337627  113.159180
5578   51.088159    4.123106   51.156622  ...  114.017543  129.402473  129.529919
5141   16.124515   63.285069   16.031220  ...   68.949257  141.396605  112.160599
5578   50.089919    3.162278   50.159745  ...  113.229855  129.259429  128.968989
5141   14.764823   60.074953   15.264338  ...   78.434686  131.912850  103.392456
5141   16.401219   57.706152   17.204651  ...   85.094066  125.251746   97.739450
5578   50.089919    3.162278   50.159745  ...  113.229855  129.259429  128.968989
5578   50.089919    3.162278   50.159745  ...  113.229855  129.259429  128.968989
5141   17.000000   56.089215   17.888544  ...   87.664132  122.702893   96.026038
5578   50.089919    3.162278   50.159745  ...  113.229855  129.259429  128.968989
5141   17.492856   57.070132   18.357560  ...   87.315520  123.032516   95.885348
5578   50.089919    3.162278   50.159745  ...  113.229855  129.259429  128.968989

我的目标是根据这些距离找到一组ID。我接下来要做的是:

#Replace minimum distance with column name and not the minimum with `False`.
closest = np.where(df_dist.eq(df_dist[df_dist != 0].min(),0),df_dist.columns,False)

这为我提供了单元格中最接近的ID的名称:

Out[32]: 
array([[   0,    0, 5141, ...,    0,    0,    0],
   [   0,    0,    0, ...,    0,    0,    0],
   [5141,    0,    0, ...,    0,    0,    0],
   ...,
   [   0,    0,    0, ...,    0,    0,    0],
   [   0,    0,    0, ...,    0,    0,    0],
   [   0,    0,    0, ...,    0,    0,    0]], dtype=int64)

# Remove false from the array and get the column names as list. 
df1['closest'] = [i[i.astype(bool)].tolist() for i in closest]
df2['closest'] = df2['closest'].agg(pd.unique)

这为我提供了ID最接近的新列。

date

2019-09-17 12:00:00.032000+00:00          [5141]
2019-09-17 12:00:00.032000+00:00    [5578, 5621]
2019-09-17 12:00:00.191000+00:00          [5141]
2019-09-17 12:00:00.191000+00:00          [5578]
2019-09-17 12:00:00.505000+00:00          [5141]
2019-09-17 12:00:00.505000+00:00    [5578, 5621]
2019-09-17 12:00:00.740000+00:00          [5578]
2019-09-17 12:00:00.740000+00:00          [5622]
2019-09-17 12:00:01.034000+00:00    [5578, 5621]
2019-09-17 12:00:01.034000+00:00    [5141, 5622]
2019-09-17 12:00:01.179000+00:00    [5578, 5621]
2019-09-17 12:00:01.179000+00:00          [5141]
2019-09-17 12:00:01.476000+00:00    [5578, 5621]
2019-09-17 12:00:01.476000+00:00          [5141]
2019-09-17 12:00:01.704000+00:00          [5141

现在,如何调整此代码,以便创建

  1. 变量n不仅可以过滤最接近的邻居,还可以获取n个最接近的邻居(例如3),并且
  2. 具有最小和最大距离的变量,以便可以控制ID的配对?例如,如果2个id之间的距离超过最大距离,则这些id将被视为独立/单独坐着。如果存在两个或多个ID之间的某个最小距离,则将它们视为成组坐在。

我希望这是有道理的,并希望那里有人可以帮助我。

0 个答案:

没有答案