如何在knngow中返回最近邻居的索引

时间:2017-10-30 12:46:17

标签: r knn nearest-neighbor

我想在dprep包中使用knngow。并且,除了为测试数据返回适当的标签之外,我还想将行索引返回到最近的邻居(在列车数据中)。此工作包中是否有任何此功能的功能?我的数据如下。

df1<-data.frame(c("a","b","c"),c(1,2,3),c("T","F","T"))
df2<-data.frame(c("a","d","f"),c(4,1,3),c("F","F","T"))
mylist1<-list()
mylist1[[1]]<-df1
mylist1[[2]]<-df2
tst1<-data.frame(c("f"),c(2))
library(dprep)
for(i in 1:length(mylist1)){
    knn_model<-knngow(mylist1[[i]],tst1,1)}

我想,除了返回标签之外,例如,显示最近的邻居在mylist中的第3行[[2]]

1 个答案:

答案 0 :(得分:1)

根据您的评论更新

我没有看到任何函数返回有关dprep包的火车数据中最近邻居的索引(希望我不会错过任何东西)。 但是,您可以做的是首先使用gower距离(FD包)计算距离矩阵,然后将此矩阵传递给k-最近邻函数(KernelKnn包接受距离矩阵为输入)。如果您决定使用KernelKnn软件包,那么首先使用 devtools :: install_github(&#39; mlampros / KernelKnn&#39;)安装最新版本。

# train-data    [ "col3" is the response variable, 'stringsAsFactors' by default ]
df1 <- data.frame(col1 = c("a","d","f"), col2 = c(1,3,2), col3 = c("T","F","T"), stringsAsFactors = T)                           

# test-data
tst1 <- data.frame(col1 = c("f"), col2 = c(2), stringsAsFactors = T)                                      

# rbind train and test data (remove the response variable from df1)
df_all = rbind(df1[, -3], tst1)                                                         

# calculate distance matrix
dist_gower = as.matrix(FD::gowdis(df_all))

# use the dist_gower distance matrix as input to the 'distMat.knn.index.dist' function
# additionaly specify which row-index is the test-data observation from the previously 'df_all' data.frame using the 'TEST_indices' parameter
idxs = KernelKnn::distMat.knn.index.dist(dist_gower, TEST_indices = c(4), k = 2, threads = 1, minimize = T)

idxs $ test_knn_idx 返回列车数据中测试数据观察的k-最近邻居

print(idxs)

$test_knn_idx
     [,1] [,2]
[1,]    3    1

$test_knn_dist
     [,1] [,2]
[1,]    0 0.75

如果你还想要类标签的概率,那么先转换为数字,然后使用 distMat.KernelKnn 函数

y_numeric = as.numeric(df1$col3)

labels = KernelKnn::distMat.KernelKnn(dist_gower, TEST_indices = c(4), y = y_numeric, k = 2, regression = F, threads = 1, Levels = sort(unique(y_numeric)), minimize = T)

print(labels)

     class_1 class_2
[1,]       0       1

# class_2 corresponds to "T" from col3 (df1 data.frame)

或者,您可以查看 dprep :: knngow ,尤其是函数的第二部分,这实际上是您感兴趣的内容,

> print(dprep::knngow)

....
    else {
        for (i in 1:ntest) {

            tempo = order(StatMatch::gower.dist(test[i, -p], train[, -p]))[1:k]

            classes[i] = moda(train[tempo, p])[1]
        }
    }
.....