Question

我有一个只有20个数据点的csv文件，我想知道一个新数据点的最近邻居。

我的csv文件如下所示

temp    rain
79        12
81        13 
79        4
61        9
60        15
45        5
34        5
100       9
101       3
59        11
58        16

所以我想知道使用欧氏距离和KNN为点65、7找到最近邻居的正确方法。在线上可用的大多数算法都使用大型数据集，例如R中的虹膜或德语，但是它很小，不需要清理，所以我觉得这些解决方案使这个问题变得过于复杂。我对R还是很陌生，所以我可能忽略了一个解决方案。感谢您抽出宝贵的时间阅读本文！

我尝试了下面的代码，但是它一直抛出错误，再次，我认为我只是使这个问题复杂化了

df <- read.csv("data.csv", header = FALSE, sep = ',')

head(df) 

ran <- sample(1:nrow(df), 0.9 * nrow(df)) 

nor <-function(x) { (x -min(x))/(max(x)-min(x))   }

df_train <- df[ran,] 

df_test <- df[-ran,] 
##extract 5th column of train dataset because it will be used as 'cl' argument in knn function.
df_target_category <- df[ran,2]
##extract 5th column if test dataset to measure the accuracy
df_test_category <- df[-ran,2]

library(class)

pr <- knn(df_train,df_test,cl=df_target_category,k=13)

##create confusion matrix
tab <- table(pr,df_test_category)

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)

Answer 1

我不确定您的问题与KNN有什么关系。为什么不简单地计算新点到df中所有其他点的欧几里得距离，然后确定dist中哪个点最接近？为此，我们可以使用R的# Calculate Euclidean distances of `pt` to all points in `df` dist_to_pt <- as.matrix(dist(rbind(df, pt)))[nrow(df) + 1, 1:nrow(df)] # Determine the point in `df` with minimal distance to `pt` dist_to_pt[which.min(dist_to_pt)] # 4 #4.472136返回一个（默认情况下为Euclidean）距离矩阵。

这是根据您提供的示例分两个步骤的最小示例。

df

因此library(dplyr) library(ggplot2) rbind(df, pt) %>% mutate( pt_number = row_number(), source = ifelse(pt_number > nrow(df), "new", "ref")) %>% ggplot(aes(temp, rain, colour = source, label = pt_number)) + geom_point() + geom_text(position = position_nudge(y = -0.5))中的点4是在（65，7）处新点的最近邻居。

我们可以可视化旧数据和新数据

df <- read.table(text =
    "temp    rain
79        12
81        13
79        4
61        9
60        15
45        5
34        5
100       9
101       3
59        11
58        16", header = T)

# New point
pt <- c(temp = 65, rain = 7)

https://stackoverflow.com/a/58155350/3633589

点4是新点12在（65，7）处的最近邻居。

样本数据

{{1}}

Answer 2

我认为基数R足以计算欧几里德距离，即

UNAUTHENTICATED

这样

distance <- sqrt(rowSums((df-do.call(rbind,replicate(nrow(df),p,simplify = FALSE)))**2))
nearest <- df[which.min(distance),]

数据

> nearest
  temp rain
4   61    9

R找到选定点的最近邻居

2 个答案:

样本数据