Question

我为这个问题读了一个类似的post related，但我担心这个错误代码是由于别的原因。我有一个包含8个观察值和10个变量的CSV文件：

 > str(rorIn)

'data.frame':   8 obs. of  10 variables:
 $ Acuity             : Factor w/ 3 levels "Elective  ","Emergency ",..: 1 1 2 2 1 2 2 3
 $ AgeInYears         : int  49 56 77 65 51 79 67 63
 $ IsPriority         : int  0 0 1 0 0 1 0 1
 $ AuthorizationStatus: Factor w/ 1 level "APPROVED  ": 1 1 1 1 1 1 1 1
 $ iscasemanagement   : Factor w/ 2 levels "N","Y": 1 1 2 1 1 2 2 2
 $ iseligible         : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1
 $ referralservicecode: Factor w/ 4 levels "12345","278",..: 4 1 3 1 1 2 3 1
 $ IsHighlight        : Factor w/ 1 level "N": 1 1 1 1 1 1 1 1
 $ RealLengthOfStay   : int  25 1 1 1 2 2 1 3
 $ Readmit            : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1

我调用这样的算法：

library("C50")
rorIn <- read.csv(file = "RoRdataInputData_v1.6.csv", header = TRUE, quote = "\"")
rorIn$Readmit <- factor(rorIn$Readmit)
fit <- C5.0(Readmit~., data= rorIn)

然后我得到：

> source("~/R-workspace/src/RoR/RoR/testing.R")
c50 code called exit with value 1
>

我正在遵循其他建议，例如： - 使用因子作为决策变量 - 避免空数据

对此有什么帮助吗？我读到这是机器学习的最佳算法之一，但我一直都会遇到这个错误。

以下是原始数据集：

Acuity,AgeInYears,IsPriority,AuthorizationStatus,iscasemanagement,iseligible,referralservicecode,IsHighlight,RealLengthOfStay,Readmit
Elective  ,49,0,APPROVED  ,N,Y,SNF            ,N,25,1
Elective  ,56,0,APPROVED  ,N,Y,12345,N,1,0
Emergency ,77,1,APPROVED  ,Y,Y,OBSERVE        ,N,1,1
Emergency ,65,0,APPROVED  ,N,Y,12345,N,1,0
Elective  ,51,0,APPROVED  ,N,Y,12345,N,2,1
Emergency ,79,1,APPROVED  ,Y,Y,278,N,2,0
Emergency ,67,0,APPROVED  ,Y,Y,OBSERVE        ,N,1,1
Urgent    ,63,1,APPROVED  ,Y,Y,12345,N,3,0

提前感谢您的帮助，

大卫

Answer 1

您需要以几种方式清理数据。

仅使用一个级别删除不必要的列。它们不包含任何信息并导致问题。
将目标变量rorIn$Readmit的类转换为因子。
将目标变量与您为培训提供的数据集分开。

这应该有效：

rorIn <- read.csv("RoRdataInputData_v1.6.csv", header=TRUE) 
rorIn$Readmit <- as.factor(rorIn$Readmit)
library(Hmisc)
singleLevelVars <- names(rorIn)[contents(rorIn)$contents$Levels == 1]
trainvars <- setdiff(colnames(rorIn), c("Readmit", singleLevelVars))
library(C50)
RoRmodel <- C5.0(rorIn[,trainvars], rorIn$Readmit,trials = 10)
predict(RoRmodel, rorIn[,trainvars])
#[1] 1 0 1 0 0 0 1 0
#Levels: 0 1

然后，您可以通过将此预测结果与目标变量的实际值进行比较来评估准确性，召回率和其他统计数据：

rorIn$Readmit
#[1] 1 0 1 0 1 0 1 0
#Levels: 0 1

通常的方法是设置一个混淆矩阵来比较二进制分类问题中的实际值和预测值。在这个小数据集的情况下，可以很容易地看出只有一个假阴性结果。所以代码似乎工作得很好，但是这个令人鼓舞的结果可能因为观察次数非常少而具有欺骗性。

library(gmodels)
actual <- rorIn$Readmit
predicted <- predict(RoRmodel,rorIn[,trainvars])     
CrossTable(actual,predicted, prop.chisq=FALSE,prop.r=FALSE)
# Total Observations in Table:  8  
#
# 
#              | predicted 
#       actual |         0 |         1 | Row Total | 
#--------------|-----------|-----------|-----------|
#            0 |         4 |         0 |         4 | 
#              |     0.800 |     0.000 |           | 
#              |     0.500 |     0.000 |           | 
#--------------|-----------|-----------|-----------|
#            1 |         1 |         3 |         4 | 
#              |     0.200 |     1.000 |           | 
#              |     0.125 |     0.375 |           | 
#--------------|-----------|-----------|-----------|
# Column Total |         5 |         3 |         8 | 
#              |     0.625 |     0.375 |           | 
#--------------|-----------|-----------|-----------|

在较大的数据集上，如果没有必要，将该集合分成训练数据和测试数据将是有用的。有很多关于机器学习的好文献可以帮助你微调模型及其预测。

名为exit的C50代码，值为1（使用因子决策变量非空值）

1 个答案: