如何使用查找表中的select colnames的粘贴值创建新列

时间:2017-01-07 19:25:23

标签: r

我正在尝试在一个数据框(名为“input”)中生成一个新列,该数据框根据查找中相关列中的值,从另一个数据框(名为“Lookup”)中获取colname(s)作为值表。以下是一些代表两个表的虚假数据:

创建假查看表

Drugs <- c("amitriptyline", "aripiprazole", "asenapine", "bupropion", "carbamazepine", "citalopram","clomipramine", "clozapine", "desipramine")
CYP1A1 <- c(NA,NA,NA,NA,NA,NA,NA,"Ind",NA)
CYP1A2 <- c("S_Inh",NA,NA,"S","S_Inh_Ind","Inh","S","Ind",NA)
CYP1B1 <- c(NA,NA,NA,NA,NA,NA,NA,"Ind",NA)
CYP2A6 <- c(NA,NA,NA,"S","Ind",NA,NA,"S","Inh")
CYP2A13 <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA)
CYP2B6 <- c("S",NA,NA,"S_Inh", "S_Ind","Inh",NA,NA,"Ind")
CYP2C8 <- c("S_Inh",NA,NA,"S","S_Ind",NA,NA,"S",NA)
CYP2C9 <- c("S",NA,NA,"S","Ind",NA,NA,"S_Inh",NA)
LookUp <- data.frame(Drugs, CYP1A1,CYP1A2, CYP1B1, CYP2A6,CYP2A13,CYP2B6,CYP2C8,CYP2C9)

LookUp
#           Drugs CYP1A1    CYP1A2 CYP1B1 CYP2A6 CYP2A13 CYP2B6 CYP2C8 CYP2C9
# 1 amitriptyline   <NA>     S_Inh   <NA>   <NA>      NA      S  S_Inh      S
# 2  aripiprazole   <NA>      <NA>   <NA>   <NA>      NA   <NA>   <NA>   <NA>
# 3     asenapine   <NA>      <NA>   <NA>   <NA>      NA   <NA>   <NA>   <NA>
# 4     bupropion   <NA>         S   <NA>      S      NA  S_Inh      S      S
# 5 carbamazepine   <NA> S_Inh_Ind   <NA>    Ind      NA  S_Ind  S_Ind    Ind
# 6    citalopram   <NA>       Inh   <NA>   <NA>      NA    Inh   <NA>   <NA>
# 7  clomipramine   <NA>         S   <NA>   <NA>      NA   <NA>   <NA>   <NA>
# 8     clozapine    Ind       Ind    Ind      S      NA   <NA>      S  S_Inh
# 9   desipramine   <NA>      <NA>   <NA>    Inh      NA    Ind   <NA>   <NA>

创建假输入表

input <- data.frame(rowID=c(1:4), Drug=Drugs[c(1,3,4,9)])
input
#   rowID          Drug
# 1     1 amitriptyline
# 2     2     asenapine
# 3     3     bupropion
# 4     4  desipramine

我想在输入中创建一个新列,输入$ metabCYPs,它是查找表中所有列名的逗号分隔字符串,其中相应的列值包含特定药物的“S”。登记/> 我认为一个组件可能是在任何列中标识所有包含'S'的值的集合:

subsVals <- c("S_Inh", "S", "S_Ind", "S_Inh_Ind")

但是,我无法弄清楚如何使用它来生成所需的输出:

output
    #   rowID          Drug   metabCYPs
    # 1     1 amitriptyline   CYP1A2, CYP2B6, CYP2C8, CYP2C9
    # 2     2     asenapine   
    # 3     3     bupropion   CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9
    # 4     4   desipramine   

任何建议都将不胜感激!

3 个答案:

答案 0 :(得分:1)

以下是dplyrreshape2个套餐的想法,

#First you add stringsAsFactors = FALSE in your dataframes,

LookUp <- data.frame(Drugs, CYP1A1,CYP1A2, CYP1B1, CYP2A6,CYP2A13,CYP2B6,CYP2C8,CYP2C9, stringsAsFactors = FALSE)
input <- data.frame(rowID=c(1:4), Drug=Drugs[c(1,3,4,9)], stringsAsFactors = FALSE)

library(dplyr)
library(reshape2)

melt(LookUp, id.vars = 'Drugs', na.rm = TRUE) %>% 
  group_by(Drugs) %>% 
  summarise(metabCYPs = toString(variable[grepl('S', value)])) %>%   
  left_join(input, ., by = c('Drug' = 'Drugs'))

#  rowID          Drug                              metabCYPs
#1     1 amitriptyline         CYP1A2, CYP2B6, CYP2C8, CYP2C9
#2     2     asenapine                                   <NA>
#3     3     bupropion CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9
#4     4   desipramine                                       

要创建其余列,只需将它们添加到summarise,即

melt(LookUp, id.vars = 'Drugs', na.rm = TRUE) %>% 
   group_by(Drugs) %>% 
   summarise(metabCYPs = toString(variable[grepl('S', value)]), 
             with_Ihn = toString(variable[grepl('Inh', value)]), 
             with_Ind = toString(variable[grepl('Ind', value)])) %>% 
   left_join(input, ., by = c('Drug' = 'Drugs'))

答案 1 :(得分:0)

首先,由于数据框LookUpinput中的变量值相同,此外,LookUp$Drugsinput$Drug似乎没有重复项加入他们是明智的,但在您需要包装之前:data.tabledplyr

install.packages(c("data.table", "dplyr"))
library(data.table)
library(dplyr)

让我们加入表格:

output <- merge(input, LookUp, by.x = "Drug", by.y = "Drugs", all.x = T)

           Drug rowID CYP1A1 CYP1A2 CYP1B1 CYP2A6 CYP2A13 CYP2B6 CYP2C8 CYP2C9
1 amitriptyline     1   <NA>  S_Inh   <NA>   <NA>      NA      S  S_Inh      S
2     asenapine     2   <NA>   <NA>   <NA>   <NA>      NA   <NA>   <NA>   <NA>
3     bupropion     3   <NA>      S   <NA>      S      NA  S_Inh      S      S
4   desipramine     4   <NA>   <NA>   <NA>    Inh      NA    Ind   <NA>   <NA>

现在,您在一个表中拥有所有必需的列。至于变量本身:

output$metabCYPs <- output[,3:10] %>%
  apply(1, paste0) %>% 
  setdiff("NA") %>% 
  paste0(collapse = ", ")

第一行从output数据框中选择第3列到第10列,第二行逐行粘贴。第三个删除NA值,最后一个值在值之间添加逗号。您可以通过以下方式删除冗余变量3-10:

output[,3:10] <- NA

瞧!

答案 2 :(得分:0)

dplyrreshape让我感到烦恼......这是另一个使用药物变量的隐式循环的想法:

metabCYPs <- sapply(LookUp$Drugs, function(x) paste0(names(LookUp[which(LookUp$Drugs == x), grepl("S", LookUp[which(LookUp$Drugs == x), setdiff(names(LookUp), "Drugs")])]), collapse = ", "))
output <- data.frame(input, metabCYPs=metabCYPs[match(input$Drugs, names(metabCYPs))])