加速acast()调用以创建矩阵

时间:2014-09-30 23:47:54

标签: r plyr reshape2

我正在使用Hadley的acast包中的reshape2函数将扁平数据集(从SQL Server查询)转换为术语文档矩阵,如下所示:

## Load packages
require("reshape2")
require("plyr")
require("RODBC")
require("lsa")

## Get flattened term-frequency data:
Terms <- read.csv(url("https://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"), header = T)
names(Terms) <- c("id", "Term", "Frequency")

system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency')) # re-cast to a term-document matrix

我遇到的问题是terms.mtrx的尺寸非常大...... 40,000行x 17,000列,矩阵非常稀疏。

> head(Terms)
                      id                      Term Frequency
1 resume-108008-34530496           enterprise data         2
2 resume-108008-34530496 enterprise data warehouse         2
3 resume-108008-34530496                       etl         2
4 resume-108008-34530496                  facility         1
5 resume-108008-34530496                   faculty         1
6 resume-108008-34530496                 financial         1
>
> dim(Terms)
[1] 6139039       3

是否有更快(更少内存密集)的方式来生成此矩阵?

1 个答案:

答案 0 :(得分:2)

我是一个在基础R中不支持https的系统,所以为了访问数据,我使用了

library(httr)
Terms <-content(GET("http://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"))
names(Terms) <- c("id", "Term", "Frequency")

然后我比较了acastxtabs(...,sparse=TRUE)

system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency'))
#    user  system elapsed 
#   9.253   0.199   9.662 

system.time(terms.mtrx2 <- xtabs(Frequency~id+Term, Terms, sparse=TRUE))
#    user  system elapsed 
#   0.083   0.009   0.092 

我们可以看到

all(terms.mtrx == terms.mtrx2)
# [1] TRUE

所以结果是一样的。