Question

此代码将获取第一行中7个最高列的名称，然后将它们粘贴到一个新变量中。我想对数据集中每行1M行的每一行执行此操作，并且我无法在合理的时间内执行循环。 R中最有效的方法是什么？

谢谢

from time import time
print time() #1481253947.930211
print int(time()*1000) # 1481253947930

Answer 1

不确定这在内存方面有多高效，但它相当快且所有基础R：

maxnrow <- function(data, n) {
  rowidx <- 1:nrow(data)
  out <- vector(mode="list", n)
  for (i in 1:n) {
    out[[i]] <- max.col(data, "first")
    data[ cbind(rowidx, out[[i]]) ] <- -Inf
  }
  do.call(paste, lapply( out, function(x) names(data)[x] ))
}


mtcars2 <- mtcars[sample(1:nrow(mtcars),1e6,replace=TRUE),]

system.time( maxnrow(mtcars2, 7) )
#   user  system elapsed 
#  10.02    0.58   10.62

Answer 2

sapply(1:nrow(mtcars), function(i) paste(names(sort(mtcars[i,1:11]))[5:11],collapse = " "))

Answer 3

使用data.table可能是提高内存效率的好方法。

这里的想法是将数据重新整形为长格式，对每辆车的值进行排序，然后为每个组选择前7位。

您可以根据需要使用结果，包括paste结果来制作新变量

library(data.table)

dt_mtcars <- as.data.table(mtcars, keep.rownames = T)

## melt the data into long form so we can sort it by one column
dt_mtcars <- melt(dt_mtcars, id.vars = "rn")

## order by group (rowname), and pick the top 7
setorder(dt_mtcars, rn, -value)
dt <- dt_mtcars[ dt_mtcars[, .I[c(1:7)], by = rn ]$V1 ]

## create a new column, consisting of the names of the 'rownames' of those top 7
dt[, paste0(variable, collapse = " "), by = rn]

                 rn                             V1
 1:         AMC Javelin   disp hp qsec mpg cyl wt drat
 2:  Cadillac Fleetwood   disp hp qsec mpg cyl wt carb
 3:          Camaro Z28   disp hp qsec mpg cyl carb wt
 4:   Chrysler Imperial   disp hp qsec mpg cyl wt carb
 5:          Datsun 710 disp hp mpg qsec cyl gear drat
 ... etc

连接每行7个最高列的名称的有效方法？

3 个答案: