Question

我有一个大型数据框（约4.5米行），每行对应一个单独的医院入院。

在每次入院时，＃7至＃26栏中最多包含20个诊断代码。另外，我有一个字段指定为＆＃34;主要诊断＆＃34;。我的假设是＆＃34;主要诊断＆＃34;对应于20个诊断代码中的第一个。这是不正确的 - 有时候它是第一个，有些是第二个，第三个等等。我对这个发行感兴趣。

ID        MainDiagCode  Diag_1  Diag_2  Diag_3 ...
Patient1  J123          J123    R343    S753
Patient2  G456          F119    E159    G456
Patient3  T789          L292    T789    W474

我想在我的数据框中添加一列，告诉我20个诊断代码中的哪一个与＆＃34; main＆＃34;之一。

ID        MainDiagCode  Diag_1  Diag_2  Diag_3 ...  NewColumn
Patient1  J123          J123    R343    S753        1
Patient2  G456          F119    E159    G456        3
Patient3  T789          L292    T789    W474        2

我已经能够循环运行了：

   df$NewColumn[i] <-
  unname(which(apply(df[i, 7:26], 2, function(x)
    any(
      grepl(df$MainDiagCode[i], x)
    ))))

我想知道在没有使用循环的情况下是否有更好的方法可以做到这一点，因为它确实很慢。

提前谢谢你。

Answer 1

df$NewColumn = apply(df, 1, function(x) match(x["MainDiagCode"], x[-c(1,2)]))

df

        ID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
1 Patient1         J123   J123   R343   S753         1
2 Patient2         G456   F119   E159   G456         3
3 Patient3         T789   L292   T789   W474         2

返回实际的列名称更安全，而不是依赖匹配位置等于诊断编号。例如：

# Get the names of the diagnosis columns
diag.cols = names(df)[grep("^Diag", names(df))]

提取匹配列的列名：

apply(df, 1, function(x) {
      names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])]
})
[1] "Diag_1" "Diag_3" "Diag_2"

提取匹配列名称末尾的数字：

library(stringr)

apply(df, 1, function(x) {
  as.numeric(
    str_extract(
      names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])], "[0-9]{1,2}$")
    )
  })

[1] 1 3 2

Answer 2

有20名诊断患者和450万名患者，使用简单的循环覆盖列并搜索匹配可能更有效：

ff = function(main, diags)
{
    ans = rep_len(NA_integer_, length(main))
    for(i in seq_along(diags)) ans[main == diags[[i]]] = i      
    return(ans)
}
ff(as.character(dat$MainDiagCode), lapply(dat[-(1:2)], as.character))
#[1] 1 3 2

如果多个诊断与主要诊断匹配，则可能需要进行调整以返回第一个诊断而不是最后一个（如上所述）诊断。也许，根据找到匹配的时间，减少每次迭代中检查的行数可能更有效。

dat = structure(list(PatientID = structure(1:3, .Label = c("Patient1", 
"Patient2", "Patient3"), class = "factor"), MainDiagCode = structure(c(2L, 
1L, 3L), .Label = c("G456", "J123", "T789"), class = "factor"), 
    Diag_1 = structure(c(2L, 1L, 3L), .Label = c("F119", "J123", 
    "L292"), class = "factor"), Diag_2 = structure(c(2L, 1L, 
    3L), .Label = c("E159", "R343", "T789"), class = "factor"), 
    Diag_3 = structure(c(2L, 1L, 3L), .Label = c("G456", "S753", 
    "W474"), class = "factor")), .Names = c("PatientID", "MainDiagCode", 
"Diag_1", "Diag_2", "Diag_3"), row.names = c(NA, -3L), class = "data.frame")

Answer 3

这会将三列的逐行比较与＆＃39; MainDiagCode＆＃39;：

apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] )  )
[1] 1 3 2

所以：

dat$NewColumn <- apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] )  )

Answer 4

由于您有很多行，因此使用data.table可以提高性能

library(data.table)
DT <- data.table(PatientID = paste0("Patient", 1:3), 
                 MainDiagCode = c("J123",  "G456", "T789"),
                 Diag_1 = c("J123", "F119", "L292"),
                 Diag_2 = c("R343", "E159", "T789"),
                 Diag_3 = c("S753", "G456", "W474")
)

DT[, NewColumn := match(MainDiagCode, .SD[, -1, with = F]), by = PatientID]
DT
#>    PatientID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
#> 1:  Patient1         J123   J123   R343   S753         1
#> 2:  Patient2         G456   F119   E159   G456         3
#> 3:  Patient3         T789   L292   T789   W474         2

查找数据框的每一行的元素的列索引

4 个答案: