我有一个大文件,其格式如下:x
userid,productid,freq
293994,8,3
293994,5,3
949859,2,1
949859,1,1
123234,1,1
123234,3,1
123234,4,1
...
它为给定用户购买的产品及其频率。我试图将它变成一个矩阵,它将所有产品作为列和用户ID作为行,并以频率值作为条目。所以预期的输出是
1 2 3 4 5 8
293994 0 0 0 0 3 3
949859 1 1 0 0 0 0
123234 1 0 1 1 0 0
这是一个稀疏矩阵。我尝试使用适用于小文件的table(x[[1]],x[[2]])
,但超出一点table
会出错
Error in table(x[[1]], x[[2]]) :
attempt to make a table with >= 2^31 elements
Execution halted
有没有办法让它发挥作用?我在R-3.1.0上并且它应该支持2 ^ 51个大小的向量,所以很困惑为什么它无法处理文件大小。我有40MM线,总文件大小为741M。提前致谢
答案 0 :(得分:3)
一种data.table
方式是:
library(data.table)
library(reshape2)
# adjust fun.aggregate as necessary - not very clear what you want from OP
dcast.data.table(your_data_table, userid ~ productid, fill = 0L)
您可以检查这是否适用于您的数据。
答案 1 :(得分:0)
这是一个tidyr方法:
library(tidyverse)
library(magrittr)
# Replicate your example data
example_data <- matrix(
c(293994,8,3,
293994,5,3,
949859,2,1,
949859,1,1,
123234,1,1,
123234,3,1,
123234,4,1),
ncol = 3,
byrow = TRUE) %>%
as.data.frame %>%
set_colnames(c('userid','productid','freq'))
# Convert data into wide format
spread(example_data, key = productid, value = freq, fill = 0)
spread
将比基本R table
操作快得多,但在规模上,data.table
反过来会轻松胜过tidyr
/ dplyr
。但是,如前面的答案所述,data.table等效dcast
无法正常工作。这似乎是一个known issue,遗憾的是,它仍未得到解决。
我大规模尝试tidyr
方法(2 mio记录)。我无法在本地计算机上运行它。因此,您必须将其删除(然后使用rbind
)或将其转移到群集(使用rhadoop
或sparklyr
)。
尽管如此,如果有人想要添加某些内容,可以在下面找到可重现的“大数据”示例的代码。
# Make some random IDs
randomkey <- function(digits){
paste(sample(LETTERS, digits, replace = TRUE), collapse = '')
}
products <- replicate(10, randomkey(20)) %>% unique
customers <- replicate(500000, randomkey(50)) %>% unique
big_example_data <- data.frame(
useruid = rep(sample(customers, length(customers), replace = FALSE), 4),
productid = sample(products, replace = TRUE),
freq = sample(1:5)
)
# 2 mio rows of purchases
dim(big_example_data)
# With useruid, productid, freq
head(big_example_data)
# Test tidyr approach
system.time(
big_matrix <- spread(big_example_data, key = productid, value = freq, fill = 0)
)
答案 2 :(得分:0)
#This is old, but worth noting the Matrix package sparseMatrix() to directly format object without reshaping.
userid <- c(293994,293994,949859,949859,123234,123234,123234)
productid <- c(8,5,2,1,1,3,4)
freq <- c(3,3,1,1,1,1,1)
library(Matrix)
#The dgCMatrix sparseMatrix is a fraction of the size and builds much faster than reshapeing if the data gets large
x <- sparseMatrix(i=as.integer(as.factor(userid)),
j=as.integer(as.factor(productid)),
dimnames = list(as.character(levels(as.factor(userid))),
as.character(levels(as.factor(productid)))
),
x=freq)
#Easily converted to a matrix.
x <- as.matrix(x)
#Learned this the hard way using recommenderlab (package built on top of Matrix) to build a binary matrix, so in case it helps someone else.