我正在尝试在一个数据框(名为“input”)中生成一个新列,该数据框根据查找中相关列中的值,从另一个数据框(名为“Lookup”)中获取colname(s)作为值表。以下是一些代表两个表的虚假数据:
Drugs <- c("amitriptyline", "aripiprazole", "asenapine", "bupropion", "carbamazepine", "citalopram","clomipramine", "clozapine", "desipramine")
CYP1A1 <- c(NA,NA,NA,NA,NA,NA,NA,"Ind",NA)
CYP1A2 <- c("S_Inh",NA,NA,"S","S_Inh_Ind","Inh","S","Ind",NA)
CYP1B1 <- c(NA,NA,NA,NA,NA,NA,NA,"Ind",NA)
CYP2A6 <- c(NA,NA,NA,"S","Ind",NA,NA,"S","Inh")
CYP2A13 <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA)
CYP2B6 <- c("S",NA,NA,"S_Inh", "S_Ind","Inh",NA,NA,"Ind")
CYP2C8 <- c("S_Inh",NA,NA,"S","S_Ind",NA,NA,"S",NA)
CYP2C9 <- c("S",NA,NA,"S","Ind",NA,NA,"S_Inh",NA)
LookUp <- data.frame(Drugs, CYP1A1,CYP1A2, CYP1B1, CYP2A6,CYP2A13,CYP2B6,CYP2C8,CYP2C9)
LookUp
# Drugs CYP1A1 CYP1A2 CYP1B1 CYP2A6 CYP2A13 CYP2B6 CYP2C8 CYP2C9
# 1 amitriptyline <NA> S_Inh <NA> <NA> NA S S_Inh S
# 2 aripiprazole <NA> <NA> <NA> <NA> NA <NA> <NA> <NA>
# 3 asenapine <NA> <NA> <NA> <NA> NA <NA> <NA> <NA>
# 4 bupropion <NA> S <NA> S NA S_Inh S S
# 5 carbamazepine <NA> S_Inh_Ind <NA> Ind NA S_Ind S_Ind Ind
# 6 citalopram <NA> Inh <NA> <NA> NA Inh <NA> <NA>
# 7 clomipramine <NA> S <NA> <NA> NA <NA> <NA> <NA>
# 8 clozapine Ind Ind Ind S NA <NA> S S_Inh
# 9 desipramine <NA> <NA> <NA> Inh NA Ind <NA> <NA>
input <- data.frame(rowID=c(1:4), Drug=Drugs[c(1,3,4,9)])
input
# rowID Drug
# 1 1 amitriptyline
# 2 2 asenapine
# 3 3 bupropion
# 4 4 desipramine
我想在输入中创建一个新列,输入$ metabCYPs,它是查找表中所有列名的逗号分隔字符串,其中相应的列值包含特定药物的“S”。登记/> 我认为一个组件可能是在任何列中标识所有包含'S'的值的集合:
subsVals <- c("S_Inh", "S", "S_Ind", "S_Inh_Ind")
但是,我无法弄清楚如何使用它来生成所需的输出:
output
# rowID Drug metabCYPs
# 1 1 amitriptyline CYP1A2, CYP2B6, CYP2C8, CYP2C9
# 2 2 asenapine
# 3 3 bupropion CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9
# 4 4 desipramine
任何建议都将不胜感激!
答案 0 :(得分:1)
以下是dplyr
和reshape2
个套餐的想法,
#First you add stringsAsFactors = FALSE in your dataframes,
LookUp <- data.frame(Drugs, CYP1A1,CYP1A2, CYP1B1, CYP2A6,CYP2A13,CYP2B6,CYP2C8,CYP2C9, stringsAsFactors = FALSE)
input <- data.frame(rowID=c(1:4), Drug=Drugs[c(1,3,4,9)], stringsAsFactors = FALSE)
library(dplyr)
library(reshape2)
melt(LookUp, id.vars = 'Drugs', na.rm = TRUE) %>%
group_by(Drugs) %>%
summarise(metabCYPs = toString(variable[grepl('S', value)])) %>%
left_join(input, ., by = c('Drug' = 'Drugs'))
# rowID Drug metabCYPs
#1 1 amitriptyline CYP1A2, CYP2B6, CYP2C8, CYP2C9
#2 2 asenapine <NA>
#3 3 bupropion CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9
#4 4 desipramine
要创建其余列,只需将它们添加到summarise
,即
melt(LookUp, id.vars = 'Drugs', na.rm = TRUE) %>%
group_by(Drugs) %>%
summarise(metabCYPs = toString(variable[grepl('S', value)]),
with_Ihn = toString(variable[grepl('Inh', value)]),
with_Ind = toString(variable[grepl('Ind', value)])) %>%
left_join(input, ., by = c('Drug' = 'Drugs'))
答案 1 :(得分:0)
首先,由于数据框LookUp
和input
中的变量值相同,此外,LookUp$Drugs
和input$Drug
似乎没有重复项加入他们是明智的,但在您需要包装之前:data.table
和dplyr
:
install.packages(c("data.table", "dplyr"))
library(data.table)
library(dplyr)
让我们加入表格:
output <- merge(input, LookUp, by.x = "Drug", by.y = "Drugs", all.x = T)
Drug rowID CYP1A1 CYP1A2 CYP1B1 CYP2A6 CYP2A13 CYP2B6 CYP2C8 CYP2C9
1 amitriptyline 1 <NA> S_Inh <NA> <NA> NA S S_Inh S
2 asenapine 2 <NA> <NA> <NA> <NA> NA <NA> <NA> <NA>
3 bupropion 3 <NA> S <NA> S NA S_Inh S S
4 desipramine 4 <NA> <NA> <NA> Inh NA Ind <NA> <NA>
现在,您在一个表中拥有所有必需的列。至于变量本身:
output$metabCYPs <- output[,3:10] %>%
apply(1, paste0) %>%
setdiff("NA") %>%
paste0(collapse = ", ")
第一行从output
数据框中选择第3列到第10列,第二行逐行粘贴。第三个删除NA
值,最后一个值在值之间添加逗号。您可以通过以下方式删除冗余变量3-10:
output[,3:10] <- NA
瞧!
答案 2 :(得分:0)
dplyr
和reshape
让我感到烦恼......这是另一个使用药物变量的隐式循环的想法:
metabCYPs <- sapply(LookUp$Drugs, function(x) paste0(names(LookUp[which(LookUp$Drugs == x), grepl("S", LookUp[which(LookUp$Drugs == x), setdiff(names(LookUp), "Drugs")])]), collapse = ", "))
output <- data.frame(input, metabCYPs=metabCYPs[match(input$Drugs, names(metabCYPs))])