我通过搜索其他人如何在Stackoverflow上做事来学习R,因此,我已经熟悉了plyr语法。我有以下四个使用ddply的plyr调用,这是我的代码的速率限制步骤。我的数据接近数十万到数百万条记录,而且我的代码主要由于data.table而过时,并且只受到四个速率限制但是关键的plyr步骤的限制。我想用dplyr或data.table替换它们,但我一直在努力复制语法,并希望得到任何帮助。
1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))
2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))
3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec
4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))
对于(1)这里是基准:
#user system elapsed
#378.463 3.136 383.786
这是我在这些步骤中想要完成的事情(它们是相似的):
stops_path = paste(stops, collapse = ",")
将这些步骤汇总到一行作为&#34; a,b,c&#34; ConvTot=sum(Conv)
我试图用dplyr或data.table复制它是不成功的。
对于这些类型的聚合,使用一个优于另一个是否有优势?我已经看过这个了,看起来data.table对于我非常简单的用例可能更好,因为语法更清晰: data.table vs dplyr: can one do something well the other can't or does poorly?
这是我尝试使用data.table复制上述(1)的失败:
setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)]
这引发了一个错误:unused argument (by = list(df.activ.id, Channel))
我从代码中写道,我在SO上研究了如何将粘贴到data.table中。我拿出了by参数只是为了看看会发生什么,并在下面的行中得到另一个错误:
mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))]
错误是"Error in paste(spotID, collapse = ",") : object 'spotID' not found"
这很奇怪,因为该字段肯定在数据中。我认为这个data.table行将使用by字段(df.activ.id和Channel)正确地聚合数据,并将字符字段与上面的(a,b,c)示例结合起来。
显然,我需要正确学习dplyr或data.table的语法,因为我正在使用的数据规模,所以我已经注册了两个数据包的数据阵营类。尽管如此,我还是对如何在短期内解决这个问题表示感谢。
谢谢!
答案 0 :(得分:1)
data.table
的复制适用于我(除非该频道已大写)。以下是我尝试使用dplyr
和data.table
复制列表的第一步。
# required packages
require(plyr)
require(dplyr)
require(data.table)
mergeddf2 <- data.frame(df.activ.id = 1:5,
channel = 1:8,
mainID = 1:40,
DateTime = Sys.Date() - 80:1,
cat = letters[1:6],
effrespflag = rnorm(240),
othervar = 1,
MarketingChannel = 2)
mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize,
spotsids = paste(mainID, collapse = ","),
spotsdt = paste(DateTime, collapse = ","),
spotsinfos = paste(cat, collapse = ","),
effrespflags = paste(effrespflag, collapse = ","))
mergeddf3.dplyr <-
mergeddf2 %>%
group_by(df.activ.id, channel) %>%
summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags"))
# check for equality
all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
## [1] TRUE
setDT(mergeddf2)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","),
spotsdt = paste(DateTime, collapse = ","),
spotsinfos = paste(cat, collapse = ","),
effrespflags = paste(effrespflag, collapse = ",")),
by=list(df.activ.id,channel)]
# check for equality
all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
## [1] TRUE