我正在尝试解决一个名为最佳医院的课程

时间:2015-07-08 16:41:32

标签: r

我正在尝试解决关于courser的任务....这是找到美国最好的医院。 我在原始数据集的基础上制作了一个小数据集。 当我在小数据集上的函数时,它给出了正确的答案。但是当我在更大的数据集上运行该函数时,对于某些值我得到结果但是对于其他值我得到以下错误:

[1] 级别: 警告信息: 最好的(“TX”,“心脏病发作”):强制引入的NA

这是我的代码:

 ##THE best hospital problem
    best <- function(state, outcome) {
      setwd("C:/Users/Praveen/Documents/R/COURSERA/R-programming/Week 3/Programming_Assignment")

###Reading the dataset
      x <- read.csv("outcome-of-care-measures.csv" , header =TRUE)
##vector of unique states in the data set
      statevector <- unique(x$State)
## vector of outcomes
      outcomevector <- c("heart attack" , "heart failure" , "pneumonia")
## checking validity of arguments      
      if(!(state %in% statevector)){
        stop("Invalid State")
      } else if(!(outcome %in% outcomevector)){
        stop("Invalid Outcome")
      } else {
        message("OK")  }
## Sub setting the data and getting relevant data set 
      X <- subset(x, x$State== state)
## if outcome is "heart attack", then calculate minimum value in 11th column in ##the data subset; Again sub setting  the data on the basis of minimum value in ##11th column      
      if(outcome == outcomevector[1]){
      y <- as.numeric(as.character(X[,11]))
      z <- min(y, na.rm=TRUE)
      z
    subsetx <- subset(X, X[,11]==z)
    answer <- subsetx[2]
    answer
## if outcome is "heart failure", then calculate minimum value in 17th column ##in the data subset; Again sub setting  the data on the basis of minimum value ##in 17th column 
    } else if (outcome == outcomevector[2]){
      y <- as.numeric(as.character(X[,17]))
      z <- min(y, na.rm=TRUE)
      z
      subsetx <- subset(X, X[,17]==z)
      answer <- subsetx[2]
      answer
## if outcome is "heart attack", then calculate minimum value in 23rd column in ##the data subset; Again sub setting  the data on the basis of minimum value in ##23rd column 
    } else {
      y <- as.numeric(as.character(X[,23]))
      z <- min(y, na.rm=TRUE)
      z
      subsetx <- subset(X, X[,23]==z)
      answer <- subsetx[2]
      answer}
##if there are two or more equal minimum values, then sort alphabetically and ##select the hospital which comes first alphabetivcally
      FA <- answer[with(answer, order(Hospital.Name)), ]
      FFA <- FA[1]
      FFANS <- droplevels(FFA)
      FFANS
    }

3 个答案:

答案 0 :(得分:3)

代码存在多个问题,但直接的问题是你被因子bug所困扰。比较这些值:

class(z)
#[1] "numeric"

class(X[,11])
#[1] "factor"

因此,当您运行此命令subsetx <- subset(X, X[,11]==z)时,即使存在匹配项,也不会得到匹配项。试试这个:

subset(X, as.numeric(X[,11])==z)

向量包含在函数as.numeric中以提供此输出。

best("TX", "heart attack")
#OK
#[1] CYPRESS FAIRBANKS MEDICAL CENTER
#Levels: CYPRESS FAIRBANKS MEDICAL CENTER
#Warning message:
#In best("TX", "heart attack") : NAs introduced by coercion

您仍然会收到警告,因为您没有从一开始就消除因素。很难分辨从何处开始修复该方法,但它可能会帮助您完成任务。

<强>更新

您可以通过向read.csv添加两个参数来开始,我们会将stringsAsFactors设置为FALSE,因此字符串仍然是字符。并na.stringsNot Available。这告诉R在文件中查找什么以确定缺失值。

x <- read.csv(file , header =TRUE, stringsAsFactors=F, na.strings="Not Available")

添加此更正步骤后,您现在可以取出所有as.numericas.character部分。看看我对心脏病发作部分做了些什么:

if("heart attack" == outcomevector[1]){
      y <- X[,11]
      z <- min(y, na.rm=TRUE)

    subsetx <- subset(X, X[,11] %in% z)
    answer <- subsetx[2]
    answer

现在y可以直接获取X[,11]的值。并且subsetx也不需要任何特殊处理。

在底部,您现在可以取出降低因子水平的最后一行。我把结局改为:

FA <- answer[with(answer, order(Hospital.Name)), ]
FA[1]

现在代码运行时,它没有警告:

best("TX", "heart attack")
#OK
#[1] "CYPRESS FAIRBANKS MEDICAL CENTER"

更新2

这是一个缩短的代码:

best2 <- function(state, outcome) {
      setwd("C:/Users/Praveen/Documents/R/COURSERA/R-programming/Week 3/Programming_Assignment")    
      x <- read.csv("outcome-of-care-measures.csv" , header =TRUE, stringsAsFactors=F)          
      outcomevector <- c("heart attack" , "heart failure" , "pneumonia")         
      if(!(state %in% unique(x$State))) stop("Invalid State")
      if(!(outcome %in% outcomevector)) stop("Invalid Outcome")

      X <- x[x$State== state,]
      names(X)[c(11, 17, 23)] <- outcomevector
      answer <- X[X[,outcome] == min(X[,outcome]), ][2]    
      FA <- answer[with(answer, order(Hospital.Name)), ]
      FA[1]   
    }

答案 1 :(得分:0)

确保您处理Not Available条目(您的作业要求您这样做)

首先将所有Not Available转换为NA

只需将na.strings = "Not Available"添加到读取语句

即可完成此操作

x <- read.csv("outcome-of-care-measures.csv" , header =TRUE, na.strings = "Not Available")

然后使用complete.cases或na.omit

取出它们

completedata <- (na.omit(x))

答案 2 :(得分:0)

我也在努力解决这个问题,但最终还是弄明白了。这是我的解决方案,评论为什么我使用了某些代码。我不确定这是否是最有效的方法,因为我是R的初学者,特别是当我保存一些数据帧变量时,如果你有机会这样做,请随意增强我的代码:

best <- function(state, outcome) {
  ## Read outcome data
  outcome_read <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
  ## subset out only the columns you need.
  subsetOutcome <- outcome_read[, c(7, 11, 17, 23, 2)]
  # convert to numerics as some of the items within the columns were read in as chars. 
  # you can see it if you run str(outcome_read)
  subsetOutcome$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure <- as.numeric(subsetOutcome$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure)
  subsetOutcome$Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia <- as.numeric(subsetOutcome$Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia)
  subsetOutcome$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack <- as.numeric(subsetOutcome$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack)
  ## Check that state is valid by using the is.element function. It will check whether the element exist within your column.
  if (!is.element(state, subsetOutcome[,1])) {
    stop("invalid state")
  }
  # checking whether the outcome agrument is a valid health condition.
  condition <- factor(c("heart attack", "heart failure", "pneumonia"))
  if (!any(c("heart attack", "heart failure", "pneumonia") == outcome)) {
    stop("invalid outcome")
  } 
  # I determine the col number here by determining the level of the condition factor. For eg, if the outcome was
  # heart failure, the level will be 2. This will be the col number.
  colNumber <- as.numeric(condition[levels(condition) == outcome])
  # self check - printing to see if the col number make sense.
  print(colNumber)
  ## removing all the NAs within the outcome column (eg, heart failure column) you are looking at.
  resultState <- subsetOutcome[complete.cases(subsetOutcome[,colNumber+1]),]
  # subsetting your dataframe to only include the state you are looking at.
  resultState <- subset(resultState, resultState$State == state)
  # order your results based on the condition (eg, heart failure), then the hospital name.
  result <- resultState[order(resultState[,colNumber+1], resultState[,5]), ]
  # returning the first row and just the hospital.name column
  result[1,"Hospital.Name"]
}