应用函数从列表中的分类树类概率的嵌套列表生成混淆矩阵

时间:2016-04-02 20:58:03

标签: r list r-caret rpart confusion-matrix

我提前道歉,对我的问题进行如此详尽的解释。我使用三个函数Shuffle100 my_ListFinal_lists(下面)在主列表中从分类树类概率(分组因子:G8和V4)生成了10个嵌套数据帧。对不起,我问这个简单的问题,但我无法弄明白。如果有人找到解决方案,非常感谢。

目标1

(1)我想将confusionMatrix()中的函数caret package插入到函数shuffle100中,为每个子集生成10个混淆矩阵

函数shuffle100my_listFinal_lists

library(plyr)
library(caret)
library(e1071)
library(rpart)

set.seed(1235)

 shuffle100 <-lapply(seq(10), function(n){ #Select the production of 10 dataframes
 subset <- normalised_scores[sample(nrow(normalised_scores), 80),] #Shuffle rows
 subset_idx <- sample(1:nrow(subset), replace = FALSE)
 subset <- subset[subset_idx, ] #training subset
 subset1<-subset[-subset_idx, ] #test subset
 subset_resampled_idx <- createDataPartition(subset_idx, times = 1, p = 0.7, list = FALSE) #70 % training set    
 subset_resampled <- subset[subset_resampled_idx, ]
 ct_mod<-rpart(Matriline~., data=subset_resampled, method="class", control=rpart.control(cp=0.005)) #10 ct
 ct_pred<-predict(ct_mod, newdata=subset[, 2:13]) 
 ct_dataframe=as.data.frame(ct_pred)#create new data frame
 confusionMatrix(ct_dataframe, normalised_scores$Family)
 }

  Error in sort.list(y) : 'x' must be atomic for 'sort.list'
  Have you called 'sort' on a list?

 1: lapply(seq(10), function(n) {
subset <- normalised_scores[sample(nrow(normalised_scores
 2: FUN(X[[i]], ...)
 3: confusionMatrix(ct_dataframe, normalised_scores$Family)
 4: confusionMatrix.default(ct_dataframe, normalised_scores$Family)
 5: factor(data)
 6: sort.list(y)

 #Produce three columns: Predicted, Actual and Binary
 my_list <- lapply(shuffle100, function(df){#Create two new columns Predicted and Actual
                  if (nrow(df) > 0)
                cbind(df, Predicted = c(""), Actual = c(""), Binary = c(""))
         else
                 bind(df, Predicted = character(), Actual = c(""), Binary = c (""))
                 })

#Fill the empty columns with NA's
Final_lists <- lapply(my_list, function(x) mutate(x, Predicted = NA, Actual = NA, Binary = NA)) 

#Create a dataframe from the column normalised_scores$Family to fill the Actual column

Actual_scores<-Final_normalised3$Family
Final_scores<-as.data.frame(Actual_scores)

#Fill in the Predicted, Actual and Binary columns

 Predicted_Lists <- Final_lists %>%
 mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8 > V4 then Predicted=G8
 mutate(Actual=Final_scores) %>% # your definition of Actual is not clear
 mutate(Binary=ifelse(Predicted==Actual, 1, 0))

#Error messages

Error in ifelse(G8 > V4, G8, V4) : object 'G8' not found

目标2

根据列V4或G8的行中概率可能更大的条件,编写函数或for循环以填充每个子集的PredictedActualBinary列比彼此更小或更小。但是,我对函数和循环的正确语法感到困惑

A for loop不起作用

  for(i in 1:length(Final_lists)){ #i loops through each dataframe in the list 
   for(j in 2:nrow(Final_lists[[i]])){ #j loops through each row of each dataframe in the list
   if(Final_lists[[i]][j, "G8"] > Final_lists[[i]][j, "V4"]) { #if the probability of G8 > V4 in each row of each dataframe in each list
      Final_lists[[i]][j, [j["Predicted" == "NA"]] ="G8" #G8 will be filled into the same row in the `Predicted' column
      }
    else {
   Final_lists[[i]][j, [Predicted == "NA"]] ="V4" #V4 will be filled into the same row in the `Predicted' column
    }
print(i)
    }
    }

填充列时,每个子集都应具有此格式:

               G8        V4 Predicted Actual Binary
        0.1764706 0.8235294        V4     V4      1
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.1764706 0.8235294        V4     V4      1

填写Predicted

如果G8的概率> V4,然后为空Predicted行分配G8。但是,如果V4> G8,然后是空的`预测&#39;行将被分配V4。

填写Actual

这些是来自每个子集的分类树模型的实际预测类概率预测,它们包含在data_frame中“normalised_scores”

填写Binary

如果PredictedActual行具有相同的结果(例如G8和G8),则为Binary行分配值1.但是,如果行{ {1}}和Predicted列不同(例如G8和V4),然后为Actual行分配值0.

我使用此工作代码实现了这些目标,但是,我不确定如何将此代码应用于主列表中的子集。

单个子集的工作代码

Binary

主列表中的子集

      set.seed(1235)

    # Randomly permute the data before subsetting
      mydat_idx <- sample(1:nrow(Final_normalised_scores), replace = FALSE)
      mydat <- Final_normalised3[mydat_idx, ]

      mydat_resampled_idx <- createDataPartition(mydat_idx, times = 1, p = 0.7, list = FALSE)
      mydat_resampled <- mydat[mydat_resampled_idx, ] # Training portion of the data
      mydat_resampled1 <- mydat[-mydat_resampled_idx, ]

      #Classification tree

      ct_mod <- train(x = mydat_resampled[, 2:13], y = as.factor(mydat_resampled[, 1]), 
            method = "rpart", trControl = trainControl(method = "repeatedcv", number=10, repeats=100, classProbs = TRUE))

       #Model predictions
       ct_pred <- predict(ct_mod, newdata = mydat[ , 2:13], type = "prob")
       Final_Predicted<-as.data.frame(ct_pred)

       #Produce three empty columns: Predicted, Actual and Binary

       Final_Predicted$Predicted<-NA
       Final_Predicted$Actual<-NA
       Final_Predicted$Binary<-NA

       #Fill in the Predicted column

      for (i in 1:length(Final_Predicted$G8)){
        if(Final_Predicted$G8[i]>Final_Predicted$V4[i]) {
           Final_Predicted$Predicted[i]<-"G8"
           }
      else {
           Final_Predicted$Predicted[i]<-"V4"
           }
           print(i)
           }

        #Fill in the Actual column using the actual predictions from the dataframe normalised_scores

        Final_Predicted$Actual<-normalised_scores$Family

        #Fill in the Binary column

        for (i in 1:length(Final_Predicted$Binary)){
           if(Final_Predicted$Predicted[i]==Final_Predicted$Actual[i]) {
              Final_Predicted$Binary[i]<-1
              }
         else {
              Final_Predicted$Binary[i]<-0
              }
              print(i)
              }

可重现的虚拟数据

SummarySE (Rmisc package) to produce a barplot with error bars (ggplot2)

1 个答案:

答案 0 :(得分:1)

您对问题的描述有点长,但可能的dplyr解决方案如下所示:

Final_Predicted$Actual <- ... # fill actual values
Final_Predicted <- Final_Predicted %>%
              mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8==V4 then Predicted=V4
              mutate(Binary=ifelse(Predicted==Actual, 1, 0))

我实际上没有运行这个解决方案,但它应该是这些简短而简单的方法。希望这会有所帮助。