Question

我有一个关于微博的数据集（600 Mb与5038720观察）关于微博，我试图找出一个用户在一小时内发布了多少推文（具有相同中间数的推文）。以下是数据集的外观：

head(mydata)

       uid              mid    year month date hour min sec
1738914174 3342412291119279 2011     8    3   21   4  12
1738914174 3342413045470746 2011     8    3   21   7  12
1738914174 3342823219232783 2011     8    5    0  17   5
1738914174 3343095924467484 2011     8    5   18  20  43
1738914174 3343131303394795 2011     8    5   20  41  18
1738914174 3343386263030889 2011     8    6   13  34  25

这是我的代码：

count <- function(x) {
length(unique(na.omit(x)))
}
attach(mydata)
hourPost <- aggregate(mid, by=list(uid, hour), FUN=count)

它在那里挂了大约半个小时，我发现所有真正的内存（24 Gb）都被使用了，它开始使用虚拟内存。知道为什么这个小任务消耗了这么多时间和记忆，我该如何改进呢？提前谢谢！

Answer 1

使用包data.table：

mydata <- read.table(text="       uid              mid    year month date hour min sec
1738914174 3342412291119279 2011     8    3   21   4  12
1738914174 3342413045470746 2011     8    3   21   7  12
1738914174 3342823219232783 2011     8    5    0  17   5
1738914174 3343095924467484 2011     8    5   18  20  43
1738914174 3343131303394795 2011     8    5   20  41  18
1738914174 3343386263030889 2011     8    6   13  34  25", 
header=TRUE, colClasses = c(rep("character",2),rep("numeric",6)), 
stringsAsFactors = FALSE)

library(data.table)
DT <- data.table(mydata)
DT[, length(unique(na.omit(mid))), by=list(uid,hour)]

aggregate将分组变量强制转换为因子，这可能会占用你的记忆（我假设你有很多级uid）。

可能有更多的优化潜力，但您没有提供具有代表性的测试用例。

R'聚合'耗尽内存

1 个答案: