从R中的行内容创建变量

时间:2012-11-18 21:17:29

标签: r

我的医院访问数据包含性别,年龄,主要诊断和医院标识符的记录。我打算为这些条目创建单独的变量。数据有一些模式:大多数观察开始于性别代码(M或F),然后是年龄,然后是诊断,主要是医院标识符。但也有一些例外。在某些情况下,性别ID编码为01或02,在这种情况下,性别标识符出现在最后。 我查看了档案并找到了一些grep的例子,但是我没有成功地将它有效地应用到我的数据中。例如代码

ndiag<-dat[grep("copd", dat[,1], fixed = TRUE),] 

可以单独提取每个诊断,但不能一次提取所有诊断。我该怎么做这个任务?

包含当前情况的示例数据(第1列)和我打算拥有的内容如下所示:

diagnosis hospital  diag    age   gender
m3034CVDA   A   cvd 30-34   M
m3034cardvA A   cardv   30-34   M
f3034aceB   B   ace 30-34   F
m3034hfC    C   hf  30-34   M
m3034cereC  C   cere    30-34   M
m3034resPC  C   resp    30-34   M
3034copd_Z_01   Z   copd    30-34   M
3034copd_Z_01   Z   copd    30-34   M
fcereZ          Z   cere    NA      F
f3034respC  C   resp    30-34   F
3034copd_Z_02   Z   copd    30-34   F

1 个答案:

答案 0 :(得分:2)

这个问题似乎有两个关键部分。

  1. 处理字符串以两种不同的方式编码的事实 方式
  2. 将字符串拼接到相应的数据列
  3. 注意:对于同时在多个值上应用函数,许多函数已经可以处理向量。例如str_locatesubstr

    第1部分 - 为m / f // 01/02编码清理字符串

    # We will be using this library later for str_detect, str_replace, etc
    library(stringr)
    
    # first, make sure diagnosis is character (strings) and not factor (category)
    diagnosis <- as.character(diagnosis)
    
    # We will use a temporary vector, to preserve the original, but this is not a necessary step.
    diagnosisTmp <- diagnosis
    
    males <- str_locate(diagnosisTmp, "_01")
    females <- str_locate(diagnosisTmp, "_02")
    
    # NOTE: All of this will work fine as long as '_01'/'_02' appears *__only__* as gender code.
    #  Therefore, we put in the next two lines to check for errors, make sure we didn't accidentally grab a "_01" from the middle of the string
    #-------------------------
      if (any(str_length(diagnosisTmp) != males[,2], na.rm=T))  stop ("Error in coding for males")
      if (any(str_length(diagnosisTmp) != females[,2], na.rm=T))   stop ("Error in coding for females")
    #------------------------
    
    # remove all the '_01'/'_02'  (replacing with "")
    diagnosisTmp <- str_replace(diagnosisTmp, "_01", "")
    diagnosisTmp <- str_replace(diagnosisTmp, "_02", "")
    
    # append to front of string appropriate m/f code 
    diagnosisTmp[!is.na(males[,1])] <- paste0("m", diagnosisTmp[!is.na(males[,1])])
    diagnosisTmp[!is.na(females[,1])] <- paste0("m", diagnosisTmp[!is.na(females[,1])])
    
    # remove superfluous underscores
    diagnosisTmp <- str_replace(diagnosisTmp, "_", "")
    
    # display the original next to modified, for visual spot check
    cbind(diagnosis, diagnosisTmp)
    

    第2部分 - 拼接字符串

    # gender is the first char, hospital is the last. 
    gender <- toupper(str_sub(diagnosisTmp, 1,1))    
    hosp  <- str_sub(diagnosisTmp, -1,-1) 
    
    # age, if present is char 2-5. A warning will be shown if values are missing. Age needs to be cleaned up
    age   <- as.numeric(str_sub(diagnosisTmp, 2,5))    # as.numeric will convert none-numbers to NA
    age[!is.na(age)]  <- paste(substr(age[!is.na(age)], 1, 2), substr(age[!is.na(age)], 3, 4), sep="-")
    
    # diagnosis is variable length, so we have to find where to start
    diagStart <- 2 + 4*(!is.na(age))
    diag  <- str_sub(diagnosisTmp, diagStart, -2)
    
    # Put it all together into a data frame
    dat <- data.frame(diagnosis, hosp, diag, age, gender)
        ## OR WITHOUT ORIGINAL DIAGNOSIS STRING ##
    dat <- data.frame(hosp, diag, age, gender)