# Create random df
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
我被这个问题难住了,想知道是否有人能看出我哪里出错了。我正在尝试计算每行和每隔一行之间的欧几里德距离。然后,我对这些距离进行排序,并通过列表 minimum_dist 中的最小距离返回“最相似”行的索引位置。
问题是这只会返回最后一行最相似的索引位置:[6.0, 3.0, 4.0]
我想要的输出是这样的:
原始ID | 匹配 |
---|---|
1 | 4,5,6 |
2 | 8,2,5 |
我已经试过了,但它给出了相同的结果:
list_of_mins = []
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
for i in range(len(test)):
list_of_mins.append(smallest_dist_ixs)
Does anyone know what's causing this problem? thank you!
答案 0 :(得分:1)
我没有可用的距离库,所以我将其更改为一个简单的总和,但在将其替换回距离后应该可以使用
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, size=(100, 23)))
test = df[:50]
dict_results = {'ids': [],
'ids_min': []}
n_min = 2
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: np.sum(row), axis=1)
# Create a new dataframe with distances.
# print(euclidean_distances)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances,
"idx": euclidean_distances.index})
selected_min = distance_frame.sort_values("dist").head(n_min)
dict_results['ids'].append(i)
dict_results['ids_min'].append(', '.join(selected_min['idx'].astype('str')))
print(pd.DataFrame(dict_results))
我对您的代码进行了一些更改:
n_min
参数来定义您想要在第二列中的元素数量(最近行的索引数)distance_frame
的方式进行解析答案 1 :(得分:0)
如果您尝试在数据框或(为了测试方便)字典中返回结果,会发生什么?例如:
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
closest_nodes = {}
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
closest_nodes[i] = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
我在您的代码中没有看到的是某种存储操作,用于将每个测试用例的一个结果放入一个永久结构中。