R data.table - 如何查找整数值并将后续列的值相乘?

时间:2014-12-18 08:58:18

标签: r data.table

假设一个人在R

中定义了以下data.table
Drug1   Dose1   Freq1   Drug2   Dose2   Freq2   Drug3   Dose3   Freq3
1234567890  2   1   1548768954  23  2   2222132435  2   2
4356678344  2   2   6547894356  3   1   2123456789  2   2
5673452976  4   1   1234567890  4   0.5 4568789076  33  4

如何仅在列“Drug1” - “Drug [x]”中搜索特定的整数值,如果找到,则创建一个新变量,该变量是以下两列的值的乘积,对应于找到整数值的同一行(此新变量中的所有其他值应为NA)?

注意:“Drug [x]”列中的整数值都是长度为10(例如1234567890,4593480033等),感兴趣的搜索词只有长度为5,从前5位开始整数(例如12345,45934等)。

因此,如果我的搜索词是整数值12345,结果将如下所示:

Drug1   Dose1   Freq1   Newvar1 Drug2   Dose2   Freq2   Newvar2 Drug3   Dose3   Freq3
1234567890  2   1   2   1548768954  23  2   NA  2222132435  2   2
4356678344  2   2   NA  6547894356  3   1   NA  2123456789  2   2
5673452976  4   1   NA  1234567890  4   0.5 -2  4568789076  33  4

谢谢。

2 个答案:

答案 0 :(得分:3)

您可以尝试Map

v1 <- grep("Drug", colnames(df))
m1 <- matrix(sort(v1+rep(1:2,each=3)),ncol=3)
df[paste0('NewVar',1:3)] <- Map(function(x,y) {
      x1 <-substr(df[,x],1,5)==12345
     Reduce(`*`,df[y]*(NA^!x1))}, v1, split(m1, col(m1)))
df
#       Drug1 Dose1 Freq1      Drug2 Dose2 Freq2      Drug3 Dose3 Freq3 NewVar1
#1 1234567890     2     1 1548768954    23   2.0 2222132435     2     2       2
#2 4356678344     2     2 6547894356     3   1.0 2123456789     2     2      NA
#3 5673452976     4     1 1234567890     4   0.5 4568789076    33     4      NA
#  NewVar2 NewVar3
#1      NA      NA
#2      NA      NA
#3       2      NA

更新

您可以尝试使用for

进行data.table循环
 library(data.table)
 DT <- as.data.table(df)
 nm1 <- grep('Drug', colnames(DT))
 nm2 <- lapply(nm1, function(x) c(x+1,x+2))
 nm3 <- paste0('NewVar', seq_along(nm1))

 for(j in seq_along(nm1)){
     DT[, (nm3[j]):= Reduce(`*`,DT[,nm2[[j]],with=FALSE
         ]*NA^!substr(DT[[nm1[j]]],1,5)==12345)]
  }

 DT
 #        Drug1 Dose1 Freq1      Drug2 Dose2 Freq2      Drug3 Dose3 Freq3 NewVar1
 #1: 1234567890     2     1 1548768954    23   2.0 2222132435     2     2       2
 #2: 4356678344     2     2 6547894356     3   1.0 2123456789     2     2      NA
 #3: 5673452976     4     1 1234567890     4   0.5 4568789076    33     4      NA
  #   NewVar2 NewVar3
  #1:      NA      NA
  #2:      NA      NA
  #3:       2      NA

或者使用@ nicola的方法基于index数字稍微改变了替代方案

 DT <- as.data.table(df)
 indx <- 1:3
  for(j in indx){
    DT[, (paste0('NewVar', j)):=  DT[[paste0("Dose",j)]]*
    DT[[paste0("Freq",j)]]*(NA^!substr(DT[[paste0("Drug",j)]],1,5)==12345)]
   }
 DT
 #        Drug1 Dose1 Freq1      Drug2 Dose2 Freq2      Drug3 Dose3 Freq3 NewVar1
 #1: 1234567890     2     1 1548768954    23   2.0 2222132435     2     2       2
 #2: 4356678344     2     2 6547894356     3   1.0 2123456789     2     2      NA
 #3: 5673452976     4     1 1234567890     4   0.5 4568789076    33     4      NA
 #   NewVar2 NewVar3
 #1:      NA      NA
 #2:      NA      NA
 #3:       2      NA

数据

df <- structure(list(Drug1 = c(1234567890, 4356678344, 5673452976), 
Dose1 = c(2L, 2L, 4L), Freq1 = c(1L, 2L, 1L), Drug2 = c(1548768954, 
6547894356, 1234567890), Dose2 = c(23L, 3L, 4L), Freq2 = c(2, 
1, 0.5), Drug3 = c(2222132435, 2123456789, 4568789076), Dose3 = c(2L, 
2L, 33L), Freq3 = c(2L, 2L, 4L)), .Names = c("Drug1", "Dose1", 
"Freq1", "Drug2", "Dose2", "Freq2", "Drug3", "Dose3", "Freq3"
), class = "data.frame", row.names = c(NA, -3L))

答案 1 :(得分:0)

如果您真的只有3种药物,您可以手动重复创建Newvar三次,然后重新排列列:

drug.id <- 12345
df[, 'Newvar1'] <- ifelse(abs(df[, 'Drug1'] - drug.id*100000)<100000, 
                          df[, 'Dose1'] * df[, 'Freq1'], 
                          NA)

然而,如果这仅仅是一个例子而且您的真实数据有更多药物,那么首先将数据重新整形为长格式然后在那里进行计算会更容易。如果必须,您可以随时返回宽屏格式。

# read data
df <- read.table(text='Drug1   Dose1   Freq1   Drug2   Dose2   Freq2   Drug3   Dose3   Freq3
1234567890  2   1   1548768954  23  2   2222132435  2   2
4356678344  2   2   6547894356  3   1   2123456789  2   2
5673452976  4   1   1234567890  4   0.5 4568789076  33  4', header=TRUE)
# reshape to long format
long.df <- reshape(df, 
                   direction = 'long', 
                   varying = list(paste0('Drug', 1:3), 
                                  paste0('Dose', 1:3), 
                                  paste0('Freq', 1:3)), 
                   v.names = c('Drug', 'Dose', 'Freq'),
                   sep = '')
# calculation of Newvar
drug.id <- 12345
long.df[, 'Newvar'] <- ifelse(abs(long.df[, 'Drug'] - drug.id*100000)<100000, 
                              long.df[, 'Dose'] * long.df[, 'Freq'], 
                              NA)
# back to wide format
wide.df <- reshape(long.df, 
                   direction = 'wide', 
                   timevar = 'time', 
                   idvar = 'id', 
                   sep = '')
wide.df