假设一个人在R
中定义了以下data.tableDrug1 Dose1 Freq1 Drug2 Dose2 Freq2 Drug3 Dose3 Freq3
1234567890 2 1 1548768954 23 2 2222132435 2 2
4356678344 2 2 6547894356 3 1 2123456789 2 2
5673452976 4 1 1234567890 4 0.5 4568789076 33 4
如何仅在列“Drug1” - “Drug [x]”中搜索特定的整数值,如果找到,则创建一个新变量,该变量是以下两列的值的乘积,对应于找到整数值的同一行(此新变量中的所有其他值应为NA)?
注意:“Drug [x]”列中的整数值都是长度为10(例如1234567890,4593480033等),感兴趣的搜索词只有长度为5,从前5位开始整数(例如12345,45934等)。
因此,如果我的搜索词是整数值12345,结果将如下所示:
Drug1 Dose1 Freq1 Newvar1 Drug2 Dose2 Freq2 Newvar2 Drug3 Dose3 Freq3
1234567890 2 1 2 1548768954 23 2 NA 2222132435 2 2
4356678344 2 2 NA 6547894356 3 1 NA 2123456789 2 2
5673452976 4 1 NA 1234567890 4 0.5 -2 4568789076 33 4
谢谢。
答案 0 :(得分:3)
您可以尝试Map
v1 <- grep("Drug", colnames(df))
m1 <- matrix(sort(v1+rep(1:2,each=3)),ncol=3)
df[paste0('NewVar',1:3)] <- Map(function(x,y) {
x1 <-substr(df[,x],1,5)==12345
Reduce(`*`,df[y]*(NA^!x1))}, v1, split(m1, col(m1)))
df
# Drug1 Dose1 Freq1 Drug2 Dose2 Freq2 Drug3 Dose3 Freq3 NewVar1
#1 1234567890 2 1 1548768954 23 2.0 2222132435 2 2 2
#2 4356678344 2 2 6547894356 3 1.0 2123456789 2 2 NA
#3 5673452976 4 1 1234567890 4 0.5 4568789076 33 4 NA
# NewVar2 NewVar3
#1 NA NA
#2 NA NA
#3 2 NA
您可以尝试使用for
data.table
循环
library(data.table)
DT <- as.data.table(df)
nm1 <- grep('Drug', colnames(DT))
nm2 <- lapply(nm1, function(x) c(x+1,x+2))
nm3 <- paste0('NewVar', seq_along(nm1))
for(j in seq_along(nm1)){
DT[, (nm3[j]):= Reduce(`*`,DT[,nm2[[j]],with=FALSE
]*NA^!substr(DT[[nm1[j]]],1,5)==12345)]
}
DT
# Drug1 Dose1 Freq1 Drug2 Dose2 Freq2 Drug3 Dose3 Freq3 NewVar1
#1: 1234567890 2 1 1548768954 23 2.0 2222132435 2 2 2
#2: 4356678344 2 2 6547894356 3 1.0 2123456789 2 2 NA
#3: 5673452976 4 1 1234567890 4 0.5 4568789076 33 4 NA
# NewVar2 NewVar3
#1: NA NA
#2: NA NA
#3: 2 NA
或者使用@ nicola的方法基于index
数字稍微改变了替代方案
DT <- as.data.table(df)
indx <- 1:3
for(j in indx){
DT[, (paste0('NewVar', j)):= DT[[paste0("Dose",j)]]*
DT[[paste0("Freq",j)]]*(NA^!substr(DT[[paste0("Drug",j)]],1,5)==12345)]
}
DT
# Drug1 Dose1 Freq1 Drug2 Dose2 Freq2 Drug3 Dose3 Freq3 NewVar1
#1: 1234567890 2 1 1548768954 23 2.0 2222132435 2 2 2
#2: 4356678344 2 2 6547894356 3 1.0 2123456789 2 2 NA
#3: 5673452976 4 1 1234567890 4 0.5 4568789076 33 4 NA
# NewVar2 NewVar3
#1: NA NA
#2: NA NA
#3: 2 NA
df <- structure(list(Drug1 = c(1234567890, 4356678344, 5673452976),
Dose1 = c(2L, 2L, 4L), Freq1 = c(1L, 2L, 1L), Drug2 = c(1548768954,
6547894356, 1234567890), Dose2 = c(23L, 3L, 4L), Freq2 = c(2,
1, 0.5), Drug3 = c(2222132435, 2123456789, 4568789076), Dose3 = c(2L,
2L, 33L), Freq3 = c(2L, 2L, 4L)), .Names = c("Drug1", "Dose1",
"Freq1", "Drug2", "Dose2", "Freq2", "Drug3", "Dose3", "Freq3"
), class = "data.frame", row.names = c(NA, -3L))
答案 1 :(得分:0)
如果您真的只有3种药物,您可以手动重复创建Newvar
三次,然后重新排列列:
drug.id <- 12345
df[, 'Newvar1'] <- ifelse(abs(df[, 'Drug1'] - drug.id*100000)<100000,
df[, 'Dose1'] * df[, 'Freq1'],
NA)
然而,如果这仅仅是一个例子而且您的真实数据有更多药物,那么首先将数据重新整形为长格式然后在那里进行计算会更容易。如果必须,您可以随时返回宽屏格式。
# read data
df <- read.table(text='Drug1 Dose1 Freq1 Drug2 Dose2 Freq2 Drug3 Dose3 Freq3
1234567890 2 1 1548768954 23 2 2222132435 2 2
4356678344 2 2 6547894356 3 1 2123456789 2 2
5673452976 4 1 1234567890 4 0.5 4568789076 33 4', header=TRUE)
# reshape to long format
long.df <- reshape(df,
direction = 'long',
varying = list(paste0('Drug', 1:3),
paste0('Dose', 1:3),
paste0('Freq', 1:3)),
v.names = c('Drug', 'Dose', 'Freq'),
sep = '')
# calculation of Newvar
drug.id <- 12345
long.df[, 'Newvar'] <- ifelse(abs(long.df[, 'Drug'] - drug.id*100000)<100000,
long.df[, 'Dose'] * long.df[, 'Freq'],
NA)
# back to wide format
wide.df <- reshape(long.df,
direction = 'wide',
timevar = 'time',
idvar = 'id',
sep = '')
wide.df