我在R中有data.table
以下格式:
COHORT VARTYPE SUM
1: RA CDS 25
2: RA INTRON 1152
3: RA DONOR 0
4: RA ACCEPTOR 1
5: RA TSS-UPSTREAM 98
---
101: YRI DISRUPT 0
102: YRI UNKNOWN-INC 979
103: YRI MIRB 0
104: YRI PFAM 8
105: YRI CGA_MIRB 0
在COHORT
列中,有5个值。它们是RA
,Lupus
,CEU
,YRI
和ASW
。
我希望根据DT$SUM
的值将DT$COHORT
列除以不同的整数。
具体来说,
If DT[COHORT=="RA"] then DT$SUM<-(DT$SUM/62)
If DT[COHORT=="Lupus"] then DT$SUM<-(DT$SUM/62)
If DT[COHORT=="YRI"] then DT$SUM<-(DT$SUM/80)
If DT[COHORT=="CEU"] then DT$SUM<-(DT$SUM/96)
If DT[COHORT=="ASW"] then DT$SUM<-(DT$SUM/5)
然而到目前为止,我的语法只能成功地将整列除以给定的整数,但只有DT$SUM
的部分具有所需的DT$COHORT
值应该分开......
谢谢
答案 0 :(得分:6)
在data.table
中,您可以像@ alexis_laz的答案(+1)一样创建另一个(查找)表,然后执行连接并重新计算SUM
,如下所示:
首先我们将生成一些数据(从@alexis_laz借用和修改一下):
require(data.table)
set.seed(101)
dat = data.table(COHORT = sample(c("RA", "Lupus", "YRI", "CEU", "ASW"), 1e5, TRUE),
SUM = sample(100, 1e5, TRUE))
由于除法会导致SUM
成为numeric
(目前为integer
),我们会在此明确转换,以避免来自{{1 }})。然后我们将设置加入密钥。
data.table
然后我们创建dat[, SUM := as.numeric(SUM)]
setkey(dat, COHORT)
(查找),其值除以:
data.table
现在,我们执行ii = data.table(COHORT=c("RA", "Lupus", "YRI", "CEU", "ASW"),
val = as.integer(c(62, 62, 80, 96, 5)))
如下(此处显示当前CRAN版本和未来data.table版本):
join
答案 1 :(得分:1)
另一种方法是使用查找向量:
#some sample data
set.seed(101)
DF = data.frame(COHORT = sample(c("RA", "Lupus", "YRI", "CEU", "ASW"), 1e5, T),
SUM = 1)
#> head(DF)
#COHORT SUM
#1 Lupus 1
#2 RA 1
#3 CEU 1
#4 CEU 1
#5 Lupus 1
#6 Lupus 1
lookup = c(62, 62, 80, 96, 5)
names(lookup) = c("RA", "Lupus", "YRI", "CEU", "ASW")
lookup
# RA Lupus YRI CEU ASW
# 62 62 80 96 5
然后匹配你的&#34; COHORT&#34;它:
ans1 = DF$SUM / unname(lookup[match(DF$COHORT, names(lookup))])
将它与你的比较:
ans2 = with(DF,
ifelse(COHORT == "RA", SUM / 62,
ifelse(COHORT == "Lupus", SUM / 62,
ifelse(COHORT == "CEU", SUM / 96,
ifelse(COHORT == "YRI", SUM / 80,
ifelse(COHORT == "ASW", SUM / 5, NA))))))
identical(ans1, ans2)
#[1] TRUE
以及一些基准测试:
library(microbenchmark)
microbenchmark(ans1 = {lookup = c(62, 62, 80, 96, 5);
names(lookup) = c("RA", "Lupus", "YRI", "CEU", "ASW");
DF$SUM / unname(lookup[match(DF$COHORT, names(lookup))])},
ans2 = with(DF,
ifelse(COHORT == "RA", SUM / 62,
ifelse(COHORT == "Lupus", SUM / 62,
ifelse(COHORT == "CEU", SUM / 96,
ifelse(COHORT == "YRI", SUM / 80,
ifelse(COHORT == "ASW", SUM / 5, NA)))))),
times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ans1 6.398761 6.604084 6.646192 6.984801 8.790249 10
# ans2 126.283224 129.819299 164.598707 167.435119 167.830104 10
答案 2 :(得分:0)
根据Agstudy的评论和更多搜索:
with(ITGAMnovelvarsDTSUM, ifelse(COHORT=="RA", SUM/62,ifelse(COHORT=="Lupus",SUM/62,ifelse(COHORT=="CEU",SUM/96,ifelse(COHORT=="YRI",SUM/5,ifelse(COHORT=="ASW",SUM/5,NA))))))