Question

我经常有一些来自某些计算的数据框，我想在输出之前进行清理，重命名和列排列。以下所有版本都可以使用，简单的data.frame最接近。

有没有办法将within和mutate的内部数据框计算与data.frame()的列顺序保存相结合，而无需额外的冗余[，....]最后？

library(plyr) 

# Given this chaotically named data.frame
d = expand.grid(VISIT=as.factor(1:2),Biochem=letters[1:2],time=1:5,
                subj=as.factor(1:3))
d$Value1 =round(rnorm(nrow(d)),2)
d$val2 = round(rnorm(nrow(d)),2)

# I would like to cleanup, compute and rearrange columns

# Simple and almost perfect
dDataframe = with(d, data.frame(
  biochem = Biochem,
  subj = subj,
  visit = VISIT,
  value1 = Value1*3 
))
# This simple solution is almost perfect, 
# but requires one more line
dDataframe$value2 = dDataframe$value1*d$val2

# For the following methods I have to reorder 
# and select in a second step

# use mutate from plyr to allow computation on computed values,
# which transform cannot do.
dMutate =   mutate(d,
  biochem = Biochem,
  subj = subj,
  visit = VISIT,
  value1 = Value1*3, #assume this is a time consuming function
  value2 = value1*val2
  # Could set fields = NULL here to remove,
  # but this does not help getting column order
)[,c("biochem","subj","visit","value1","value2")]

# use within. Same problem, order not preserved
dWithin = within(d, {
  biochem = Biochem
  subj = subj
  visit = VISIT
  value1 = Value1*3
  value2 = value1*val2       
})[,c("biochem","subj","visit","value1","value2")]


all.equal(dDataframe,dWithin)
all.equal(dDataframe,dMutate)

Answer 1

您可以使用summarize包中的summarise（或plyr）。来自doc：

总结以一种类似的方式进行转换，除了不将列添加到现有数据框之外，它创建了一个新的数据框。 [...]

对于你的例子：

library(plyr)
summarize(d,
  biochem = Biochem,
  subj    = subj,
  visit   = VISIT,
  value1  = Value1 * 3,
  value2  = value1 * val2       
)

Answer 2

如果您愿意转到data.table，那么您可以通过引用执行（大多数）这些操作，并避免与[<-.data.frame和$<-.data.frame相关联的复制

setnames将重命名data.table。 setcolorder将重新排序data.table，:=将通过引用分配。

library(data.table)
DT <- data.table(d)
# rename to lowercase only
setnames(DT, old = names(DT), new = tolower(names(DT))
# reassign using `:=`
# note the use of `value1<-value1` to allow later use. 
# This will not be necessary once FR1492 has been implemented
# setting to NULL removes these columns
DT[, `:=`(value1 =value1<- value1*3, 
         value2  = value1 * val2, 
         val2 = NULL, time = NULL )]
setcolorder(DT, c("biochem","subj","visit","value1","value2"))

如果你不太关心内存效率，并希望使用data.table语法，那么

DT <- data.table(d)
DT[,list(  biochem = Biochem,   
    subj    = subj,
   visit   = VISIT,
   value1 = value1  <- Value1 * 3,
   value2  = value1 * val2       
   )]

会工作。

计算，列排列并选择`within`数据框

2 个答案: