我有一个包含列名
的逻辑回归汇总统计数据框 "CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"
我想创建一个包含三列的新数据框:
"SNP" "A1"
和"logOR"
显而易见的方法是创建一个新列logOR,然后简单地在这三列上进行子集化。
但是,我想知道是否可以在子集化过程中执行日志(OR)?
我试过了:
raw<-c("SNP","A1","log(OR)")
data.raw<-data[,raw]
R对此并没有太深刻的印象。
提前致谢!
答案 0 :(得分:3)
使用with
是一种很简单的方法:
dat.raw <- with(data, data.frame(SNP,A1,log(OR)))
答案 1 :(得分:2)
最快和最干净的方式(imo)是使用基函数transform
transform(data,logOR = log(OR))[c("SNP","A1","logOR")]
<强>加成强>
还有其他方法可以做到这一点,我已经将一些方法相互比较,并为一个越来越小的数据集(1000行或100000)提供结果。
无论如何, transform
是最快的。它是一个基本函数,在这种情况下的行为与mutate完全相同。
with
对我不太有意义,而且#34;哲学上#34;在这里,但它是data.table
之后的最短行,当大小上升时几乎与transform
相当。
small data.frames
library(microbenchmark)
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
"NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)
microbenchmark(
transform = transform(data,logOR = log(OR))[c("SNP","A1","logOR")],
within = within (data,logOR <- log(OR))[c("SNP","A1","logOR")],
with = with (data, data.frame(SNP,A1,logOR=log(OR))), # jkt's solution
mutate = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")], # mutate will behave exactly the same as transform in this case
mutate_p = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR), # same function but with the pipe syntax as formulated by Craig did in the comments
data.table = as.data.table(data)[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table with conversion
data.table2 = data2[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table without conversion, this adds logOR to data2 however
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# transform 202.086 243.4945 281.1694 263.3140 286.6725 6781.367 1000
# within 290.919 353.2080 395.3183 373.5580 397.4480 7039.017 1000
# with 279.948 337.8130 406.2508 361.8790 392.1390 7601.388 1000
# mutate 912.040 1056.2610 1215.2035 1107.4010 1185.4395 8148.541 1000
# mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564 1000
# data.table 938.584 1058.5610 1175.6758 1116.7795 1214.4605 5079.035 1000
# data.table2 819.314 935.5755 1086.9992 993.6175 1084.0425 27160.856 1000
更大的data.frames
n <- 100000
...
# Unit: milliseconds
# expr min lq mean median uq max neval
# transform 3.005094 3.320254 3.978661 3.548707 3.815381 14.87116 1000
# within 3.252126 3.618074 4.542457 3.929165 4.275118 99.77254 1000
# with 3.102066 3.413511 4.229389 3.653466 3.937482 89.80346 1000
# mutate 3.803171 4.221853 4.931597 4.474195 4.815546 26.43214 1000
# mutate_p 4.283788 4.754672 5.622917 4.996396 5.366238 92.74237 1000
# data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386 1000
# data.table2 3.997825 4.749894 6.677897 5.456840 6.125562 116.99369 1000
编辑:添加了data.table解决方案