Question

我的数据有一列，我正在尝试使用行中每个“/”之后的内容创建其他列。找到我之前相关问题的答案here。以下是前5行数据：

> dput(mydata)
structure(list(ALL = structure(c(1L, 4L, 4L, 3L, 2L), .Label = c("/ca/put/
sent_1/fe.gr/eq2_on/eq2_off",
"/ca/put/sent_1/fe.gr/eq2_on/eq2_off/cbr_LBL", "/ca/put/sent_1/fe.gr/eq2_o
n/eq2_off/cni_at.p3x.4",
"/ca/put/sent_1/fe.gr/eq2_on/eq2_off/hi.on/hi.ov"), class = "factor")), .N
ames = "ALL", class = "data.frame", row.names = c(NA, 
-5L))

以下适用于5行样本：

res <- strsplit(as.character(mydata$ALL),"/", fixed=T)
res.df <- as.data.frame(do.call(rbind, lapply(lapply(res, factor, levels
=unique(unlist(res))), table)))

但是有数百万行它很慢...... system.time(replicate(75000000, res.df))返回错误，时间停在563.04 21.28 644.77

（错误：无法分配大小为2.8Gb的向量...）：

原始数据超过400M行，“/”之间的字符串生成大约100列。有没有办法在R？

加速上述操作

Answer 1

有两件事可能有助于加快res.df的创建。首先，您不希望在每次迭代期间执行unique(unlist(res))。其次，你应该结合lapply中使用的函数，这样你只需要对数据进行单一传递。您可以使用Compose包中的functional，但编写自己的包也一样容易。

lvls <- unique(unlist(res))
helper <- function(x) 
{
    table(factor(x, levels=lvls))
}

res.df <- as.data.frame(do.call(rbind, lapply(res, helper)))

如果数据集很大，这可能无法解决您的问题，但这是一个可以开始的地方。

Answer 2

如果它们都很慢，你确实有另一种选择;将该列写入文件，然后将其作为分隔文件读取，并使用sep =“/”。然后cbind两个data.frames。

它不是特别优雅，但是。

如何加快R中的操作

2 个答案: