Question

我正在处理财务数据集。数据集有3M行，对应30k公司，包含数字，浮点数，分类，字符串等。

我的算法需要对给定公司的所有行执行计算。为此，我使用subset函数来标识相关行。

但是你可以想象这是非常低效的，因为为了识别子集，R必须遍历所有3M行。如果我对所有30k公司重复这一步骤，R将遍历数据集30k次，这太可怕了。

更好的方法是以某种方式按公司对数据集进行分组，并且只能访问所需的行。

我知道我可以在Python使用词典非常有效地执行此操作，其中keys与name of company相对应，值为list all the rows对于那家公司，从而允许我一次访问相关的行。

但我不知道如何在R中进行类似的高效存储/检索。任何帮助/指针都会受到赞赏。

Answer 1

P Lapointe关于data.table的评论是正确的，我不认为你会发现更好的东西。为了比较，我知道的最好的基本R方法是通过分割行索引然后对其进行子集来制作密钥。这比单独使用子集或拆分整个数据帧要快得多。 plyr与分割行索引的速度大致相同。 data.table的速度提高了几个数量级。时间安排在我的系统中，我没有打算正确地进行基准测试。

d <- data.frame(company=factor(rep(1:3e4,100)),
                other=round(sample(runif(3e6)),2))

## using subset individually
## 1.8 sec for 10 companies, so ~540 sec total
out0 <- sapply(levels(d$company)[1:10], function(companyi) {
    di <- subset(d, company==companyi)
    mean(di$other)
})

## ## "standard" way; split the data frame and 
## ## the split is prohibitively slow, probably too memory intensive
## ds <- split(d, d$company)
## sapply(ds, function(di) mean(di$other))

## not too bad, but still slow, possibly the best base R method?
## 2.6 sec to do only first 1000 companies, so ~78 sec total
idx <- split(seq_len(nrow(d)), d$company)
out1 <- sapply(idx[1:1000], function(i) mean(d[i,]$other))

## plyr, about the same timing as above
library(plyr)
out2 <- ddply(d[1:1e4,], ~company, summarize, m=mean(other))

## data table is the clear speed demon
## 0.07 sec to do all companies
library(data.table)
DT <- as.data.table(d)
out3 <- DT[, mean(other), keyby=company]

在R

1 个答案: