Question

我想在R中选择模型时检查所有列的排列和组合。我的数据集中有8列，下面的代码让我检查一些模型，但不是全部。此循环不会涵盖第1 + 6列，第1 + 2 + 5列等模型。有没有更好的方法来实现这个目标？

best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
  for(j in 1:8){
    for(x in k){
      diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
      accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
      if( best_model[1] < accuracy[x] ){
        best_model[1] = accuracy[x]
        best_model[2] = i
        best_model[3] = j
      }
    }
  }
}

Answer 1

嗯，这个答案还没有完成，但也许它会让你开始。您希望能够按列的所有可能子集进行子集化。因此，对于某些i和j而言，不需要i：j，而是希望能够通过c（1,6）或c（1,2,5）等进行子集化。

使用集合包，您可以获取集合的电源集（所有子集的集合）。这很容易。我是R的新手，所以对我来说很难理解集合，列表，向量等之间的区别。我已经习惯了Mathematica，它们都是相同的。

  library(sets)
  my.set <- 1:8  # you want column indices from 1 to 8
  my.power.set <- set_power(my.set)  # this creates the set of all subsets of those indices
  my.names <- c("a")  #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
  for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
  names(my.power.set) <- my.names
  my.indices <- vector("list",length(my.power.set)-1)
  for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work

我想创建一个名为my.indices的列表列表，以便my.indices [i]是{1,2,3,4,5,6,7,8}的一个子集，可用于你所拥有的地方我：j。然后，你的for循环必须从1：length（my.indices）运行。

但是唉，我被Mathematica宠坏了，因此无法解读R数据类型这个极其复杂的世界。

Answer 2

解决了它，下面是带有解释性注释的代码：

# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
  # ignoring the first case i.e. i=1, as it doesn't represent any model
  # convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
  combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
  model <- c()
  for(i in 1:length(combination)){
    # choose which columns to consider depending on the combination
    if(combination[i])
      model <- c(model, i)
  }
  for(x in k){
    # for the columns decides by model, find out the accuracies of model for k=1:27
    diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
    accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
    if( best_model_accuracy < accuracy[x] ){
      best_model_accuracy = accuracy[x]
      best_model = model
      print(model)
    }
  }
}

Answer 3

我使用Pima.tr进行训练并使用Pima.te进行测试。 KNN预处理预测值的准确率为78％和80％，没有预处理（这是因为某些变量的影响很大）。
80％的表现与Logistic回归模型相当。您不需要在Logistic回归中预处理变量。 RandomForest和Logistic回归提供了要删除哪些变量的提示，因此您无需执行所有可能的组合。另一种方法是查看矩阵散点图

你会感觉到，当涉及到npreg，glu，bmi，年龄时，类型0和类型1之间存在差异您还注意到高度倾斜的ped和年龄，并且您注意到皮肤和其他变量之间可能存在异常数据点（您可能需要在进一步之前删除该观察结果） Skin Vs Type框图显示对于类型Yes，存在极端异常值（尝试删除它）您还注意到，Yes类型的大多数框都高于No type =＆gt;变量可以为模型添加预测（您可以通过Wilcoxon秩和检验来确认） Skin和bmi之间的高度相关性意味着您可以使用其中一种或两种相互作用。减少预测变量数量的另一种方法是使用PCA

R

3 个答案: