将data.table的一列除以整数,取决于R中的另一列

时间:2014-06-04 08:04:25

标签: r data.table

我在R中有data.table以下格式:

     COHORT      VARTYPE  SUM
  1:     RA          CDS   25
  2:     RA       INTRON 1152
  3:     RA        DONOR    0
  4:     RA     ACCEPTOR    1
  5:     RA TSS-UPSTREAM   98
 ---                         
101:    YRI      DISRUPT    0
102:    YRI  UNKNOWN-INC  979
103:    YRI         MIRB    0
104:    YRI         PFAM    8
105:    YRI     CGA_MIRB    0

COHORT列中,有5个值。它们是RALupusCEUYRIASW

我希望根据DT$SUM的值将DT$COHORT列除以不同的整数。

具体来说,

If DT[COHORT=="RA"]   then  DT$SUM<-(DT$SUM/62)
If DT[COHORT=="Lupus"]   then  DT$SUM<-(DT$SUM/62)
If DT[COHORT=="YRI"]   then  DT$SUM<-(DT$SUM/80)
If DT[COHORT=="CEU"]   then  DT$SUM<-(DT$SUM/96)
If DT[COHORT=="ASW"]   then  DT$SUM<-(DT$SUM/5)

然而到目前为止,我的语法只能成功地将整列除以给定的整数,但只有DT$SUM的部分具有所需的DT$COHORT值应该分开......

谢谢

3 个答案:

答案 0 :(得分:6)

data.table中,您可以像@ alexis_laz的答案(+1)一样创建另一个(查找)表,然后执行连接并重新计算SUM,如下所示:

首先我们将生成一些数据(从@alexis_laz借用和修改一下):

require(data.table)
set.seed(101)
dat = data.table(COHORT = sample(c("RA", "Lupus", "YRI", "CEU", "ASW"), 1e5, TRUE), 
                 SUM = sample(100, 1e5, TRUE))

由于除法会导致SUM成为numeric(目前为integer),我们会在此明确转换,以避免来自{{1 }})。然后我们将设置加入密钥。

data.table

然后我们创建dat[, SUM := as.numeric(SUM)] setkey(dat, COHORT) (查找),其值除以:

data.table

现在,我们执行ii = data.table(COHORT=c("RA", "Lupus", "YRI", "CEU", "ASW"), val = as.integer(c(62, 62, 80, 96, 5))) 如下(此处显示当前CRAN版本和未来data.table版本):

join

答案 1 :(得分:1)

另一种方法是使用查找向量:

#some sample data
set.seed(101)
DF = data.frame(COHORT = sample(c("RA", "Lupus", "YRI", "CEU", "ASW"), 1e5, T), 
                SUM = 1)
#> head(DF)
#COHORT SUM
#1  Lupus   1
#2     RA   1
#3    CEU   1
#4    CEU   1
#5  Lupus   1
#6  Lupus   1

lookup = c(62, 62, 80, 96, 5)
names(lookup) = c("RA", "Lupus", "YRI", "CEU", "ASW")
lookup
# RA Lupus   YRI   CEU   ASW 
# 62    62    80    96     5

然后匹配你的&#34; COHORT&#34;它:

ans1 = DF$SUM / unname(lookup[match(DF$COHORT, names(lookup))])

将它与你的比较:

ans2 = with(DF, 
     ifelse(COHORT == "RA", SUM / 62,
            ifelse(COHORT == "Lupus", SUM / 62,
                   ifelse(COHORT == "CEU", SUM / 96,
                          ifelse(COHORT == "YRI", SUM / 80,
                                 ifelse(COHORT == "ASW", SUM / 5, NA))))))
identical(ans1, ans2)
#[1] TRUE

以及一些基准测试:

library(microbenchmark)
microbenchmark(ans1 = {lookup = c(62, 62, 80, 96, 5);
                       names(lookup) = c("RA", "Lupus", "YRI", "CEU", "ASW");
                       DF$SUM / unname(lookup[match(DF$COHORT, names(lookup))])},
               ans2 = with(DF, 
                           ifelse(COHORT == "RA", SUM / 62,
                           ifelse(COHORT == "Lupus", SUM / 62,
                           ifelse(COHORT == "CEU", SUM / 96,
                           ifelse(COHORT == "YRI", SUM / 80,
                           ifelse(COHORT == "ASW", SUM / 5, NA)))))),
               times = 10)
#Unit: milliseconds
# expr        min         lq     median         uq        max neval
# ans1   6.398761   6.604084   6.646192   6.984801   8.790249    10
# ans2 126.283224 129.819299 164.598707 167.435119 167.830104    10

答案 2 :(得分:0)

根据Agstudy的评论和更多搜索:

with(ITGAMnovelvarsDTSUM, ifelse(COHORT=="RA", SUM/62,ifelse(COHORT=="Lupus",SUM/62,ifelse(COHORT=="CEU",SUM/96,ifelse(COHORT=="YRI",SUM/5,ifelse(COHORT=="ASW",SUM/5,NA))))))