如果其他变量在R

时间:2019-04-05 22:28:16

标签: r merge duplicates unique na

根据给药途径,我有以下包含药品代码的数据框:

code <- data.frame(inn = c("ibuprofen", "ibuprofen", "ibuprofen", "fusidic acid", "fusidic acid"),
                   route = c("unknown", "unknown", "unknown", "oral", "topical"),
                   atc = c("R02AX02", "G02CC01", "M01AE01", "J01XC01", "D06AX01"))

           inn   route     atc
1    ibuprofen unknown R02AX02
2    ibuprofen unknown G02CC01
3    ibuprofen unknown M01AE01
4 fusidic acid    oral J01XC01
5 fusidic acid topical D06AX01

另一个包含患者治疗和事件的信息:

event <- data.frame(id = c(1, 1, 2),
                    inn = c("ibuprofen", "fusidic acid", "fusidic acid"),
                    route = c("unknown", "oral", "topical"),
                    event = c(TRUE, FALSE, TRUE))

  id          inn   route event
1  1    ibuprofen unknown  TRUE
2  1 fusidic acid    oral FALSE
3  2 fusidic acid topical  TRUE

我需要合并这些数据框以获得以下结果:

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE NA

我无法通过简单的merge得到此结果:

merge(x = event,
      y = code)

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE R02AX02
4    ibuprofen unknown  1  TRUE G02CC01
5    ibuprofen unknown  1  TRUE M01AE01

我想到了两种解决方案,但是我没有实现任何解决方案:

  • 如果一组code的{​​{1}}不同,请在merge之前修改atc数据帧以将NA设置为atcinn(这似乎更合适)
  • 如果routemergeatc的组存在不同的NA,则修改atc的结果以将inn设置为routeid

如何在基准R中做到这一点?还有另一种更好的方法吗?我在一个只能访问基本R的限制性环境中工作。

3 个答案:

答案 0 :(得分:2)

案例2的代码:

code$inn_route <- paste0(code$inn,'_',code$route)
code$count <- table(code$inn_route)[code$inn_route]
code[code$count>1,3]<-NA
code$inn_route <- NULL
code$count <- NULL
code <- unique(code)
merge(event,code)


           inn   route id event   atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>

答案 1 :(得分:1)

这是完成选项2的直接方法。从简单合并的结果开始:

mrg <- merge(x = event,
             y = code)

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE R02AX02
4    ibuprofen unknown  1  TRUE G02CC01
5    ibuprofen unknown  1  TRUE M01AE01

然后我们检查哪些行重复(删除atc变量)。我们需要使用重复项两次,因为它实际上会找到 duplicate 行,而不是具有重复项的行。因此,它将捕获第4行和第5行,但不会捕获第3行-为此,我们需要从相反的方向重复duplicated。在此处阅读更多信息:Finding ALL duplicate rows, including “elements with smaller subscripts”

mrg$atc <- ifelse(duplicated(mrg[,-5]) | duplicated(mrg[,-5], fromLast = T),
                  NA,
                  mrg$atc)
mrg

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>
4    ibuprofen unknown  1  TRUE    <NA>
5    ibuprofen unknown  1  TRUE    <NA>

如果要摆脱重复的行4和5,只需再运行duplicated一次,即可将其删除:

mrg[!duplicated(mrg),]

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>

答案 2 :(得分:0)

Grzegorz Sionkowski's answer使我想到了以下解决方案:

ave

但是,由于{{1}}在我的真实数据上相当慢,所以我想知道是否有更快的基本R方法。