我正在使用Hadley的acast
包中的reshape2
函数将扁平数据集(从SQL Server查询)转换为术语文档矩阵,如下所示:
## Load packages
require("reshape2")
require("plyr")
require("RODBC")
require("lsa")
## Get flattened term-frequency data:
Terms <- read.csv(url("https://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"), header = T)
names(Terms) <- c("id", "Term", "Frequency")
system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency')) # re-cast to a term-document matrix
我遇到的问题是terms.mtrx
的尺寸非常大...... 40,000行x 17,000列,矩阵非常稀疏。
> head(Terms)
id Term Frequency
1 resume-108008-34530496 enterprise data 2
2 resume-108008-34530496 enterprise data warehouse 2
3 resume-108008-34530496 etl 2
4 resume-108008-34530496 facility 1
5 resume-108008-34530496 faculty 1
6 resume-108008-34530496 financial 1
>
> dim(Terms)
[1] 6139039 3
是否有更快(更少内存密集)的方式来生成此矩阵?
答案 0 :(得分:2)
我是一个在基础R中不支持https的系统,所以为了访问数据,我使用了
library(httr)
Terms <-content(GET("http://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"))
names(Terms) <- c("id", "Term", "Frequency")
然后我比较了acast
和xtabs(...,sparse=TRUE)
system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency'))
# user system elapsed
# 9.253 0.199 9.662
system.time(terms.mtrx2 <- xtabs(Frequency~id+Term, Terms, sparse=TRUE))
# user system elapsed
# 0.083 0.009 0.092
我们可以看到
all(terms.mtrx == terms.mtrx2)
# [1] TRUE
所以结果是一样的。