在R中进行子集化时,我可以操纵列吗?

时间:2017-07-12 11:27:21

标签: r subset data-manipulation

我有一个包含列名

的逻辑回归汇总统计数据框

"CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"

我想创建一个包含三列的新数据框:

"SNP" "A1""logOR"

显而易见的方法是创建一个新列logOR,然后简单地在这三列上进行子集化。

但是,我想知道是否可以在子集化过程中执行日志(OR)?

我试过了:

raw<-c("SNP","A1","log(OR)")

data.raw<-data[,raw]

R对此并没有太深刻的印象。

提前致谢!

2 个答案:

答案 0 :(得分:3)

使用with是一种很简单的方法:

dat.raw <- with(data, data.frame(SNP,A1,log(OR)))

答案 1 :(得分:2)

最快和最干净的方式(imo)是使用基函数transform

transform(data,logOR =  log(OR))[c("SNP","A1","logOR")]

<强>加成

还有其他方法可以做到这一点,我已经将一些方法相互比较,并为一个越来越小的数据集(1000行或100000)提供结果。

无论如何,

transform是最快的。它是一个基本函数,在这种情况下的行为与mutate完全相同。

with对我不太有意义,而且#34;哲学上#34;在这里,但它是data.table之后的最短行,当大小上升时几乎与transform相当。

small data.frames

library(microbenchmark)    
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
                   "NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)

microbenchmark(
  transform   = transform(data,logOR =  log(OR))[c("SNP","A1","logOR")],
  within      = within   (data,logOR <- log(OR))[c("SNP","A1","logOR")],
  with        = with     (data, data.frame(SNP,A1,logOR=log(OR))),               # jkt's solution
  mutate      = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")],             # mutate will behave exactly the same as transform in this case
  mutate_p    = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR),     # same function but with the pipe syntax as formulated by Craig did in the comments
  data.table  = as.data.table(data)[,logOR :=  log(OR)][,.(SNP,A1,logOR)],       # data.table with conversion
  data.table2 = data2[,logOR :=  log(OR)][,.(SNP,A1,logOR)],                     # data.table without conversion, this adds logOR to data2 however
  times = 1000)

# Unit: microseconds
#       expr      min        lq      mean    median        uq       max neval
#   transform  202.086  243.4945  281.1694  263.3140  286.6725  6781.367  1000
#      within  290.919  353.2080  395.3183  373.5580  397.4480  7039.017  1000
#        with  279.948  337.8130  406.2508  361.8790  392.1390  7601.388  1000
#      mutate  912.040 1056.2610 1215.2035 1107.4010 1185.4395  8148.541  1000
#    mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564  1000
#  data.table  938.584 1058.5610 1175.6758 1116.7795 1214.4605  5079.035  1000
# data.table2  819.314  935.5755 1086.9992  993.6175 1084.0425 27160.856  1000

更大的data.frames

n <- 100000
...
# Unit: milliseconds
#        expr      min       lq     mean   median        uq       max neval
#   transform 3.005094 3.320254 3.978661 3.548707  3.815381  14.87116  1000
#      within 3.252126 3.618074 4.542457 3.929165  4.275118  99.77254  1000
#        with 3.102066 3.413511 4.229389 3.653466  3.937482  89.80346  1000
#      mutate 3.803171 4.221853 4.931597 4.474195  4.815546  26.43214  1000
#    mutate_p 4.283788 4.754672 5.622917 4.996396  5.366238  92.74237  1000
#  data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386  1000
# data.table2 3.997825 4.749894 6.677897 5.456840  6.125562 116.99369  1000

编辑:添加了data.table解决方案