R - 高维稀疏数据帧的有效选择性和

时间:2014-02-25 13:21:29

标签: r dataframe sparse-matrix

我觉得我的电脑似乎要慢一点:

library(plyr)

# Function for creating random n pseudowords of predefined length needed for colnames. Proposed by: http://ryouready.wordpress.com/2008/12
colnamesString <- function(n=10000, lenght=12) /18/generate-random-string-name/
{
  randomString <- c(1:n)                  # initialize vector
  for (i in 1:n)
  {
    randomString[i] <- paste(sample(c(0:9, letters, LETTERS),
                                    lenght, replace=TRUE),
                             collapse="")
  }
  return(randomString)
}

set.seed(1)
myColnames <- strsplit(colnamesString(10000,8), " ") # vector with 10000 random colnames of length 8
datfra <- data.frame(matrix(data = sample(c(0,1),(10000*1500), replace= TRUE), nrow= 1500, ncol= 10000, dimnames= list(NULL, myColnames))) # creates a dandom dataframe with the colnames created before with binary (not essential, for readablity) values.
datfra <- cbind(datfra, colID=(sample(c(1:150),1500, replace= TRUE))) # creates IDs vector
datfra[1:5,c(1:3,10001)] # small section of the created dataframe, with coresponding IDs

coldatfra <- ddply(datfra[1:50,c(1:5,10001)], .(colID), numcolwise(sum)) # The solution for a small subset of the big dataframe.
#It works fine! But if applied to the whole dataframe it never ends computing.

# Therefore the challange is how to compute efficiently with an ALTERNATIVE approach to this?
coldatfra <- ddply(datfra, .(colID), numcolwise(sum)) # stopped after 15m of computing

EDITstart

目的是为列中的每个唯一colID所有列中的所有条目汇总。检验结果为:

coldatfra[1:10,c(1:5,10001)] # Small subset of rows, only for five columns + colID colum:
   gnzUcTWE D3caGnLu IZnMVdE7 gn0nRltB ubPFN6Ip colID
1         3        4        5        5        6    12
2        10        8        7        4        7    24
3         4        8        4        5        5    36
4         2        4        6        5        5    36
5         5        6        6        6        7    55
6         5        2        4        3        4    42
7         5        3        6        5        4    63
8         8       12        8        8       10   160
9         7        3        5        3        3    90
10        2        3        1        2        2    60

EDITend

1 个答案:

答案 0 :(得分:2)

编辑:我认为我误解了OP,这是我保留列的新理解:

library(data.table)
res <- data.table(datfra)[, lapply(.SD, sum), by=colID]
# user  system elapsed 
# 8.32    0.05    8.38     

这比ddply版本快4.5倍。不幸的是,这仍然有点慢。


OLD STUFF:

如果我理解您正在尝试正确执行的操作,则可以通过首先计算所有列的行总和,然后按组聚合来更快地完成此操作:

datfrasum <- 
  data.frame(
    sums=rowSums(datfra[, names(datfra) != "colID"]), 
    colID=datfra$colID
  )
ddply(datfrasum, .(colID), colSums)

# user  system elapsed 
# 0.37    0.02    0.39 

在这种情况下,非常慢的步骤是尝试为这么多列生成所有组,因此这要快得多。一般来说,您希望使用data.tabledplyr代替plyr,因为后者现在比其他两个在性能方面落后一代,但即使是那些你应该考虑列首先崩溃。

这是一个data.table替代方案,但是因为它不首先执行rowums,它实际上比上面的方法慢:

library(data.table)
dattab <- data.table(datfra)
dattab[, sum(unlist(.SD)), by=colID]

如果你要做rowums并使用data.table,那就更快了。