有两个数据帧:
URLSMMG
,共有374次观察pagesVisited
,观看次数为99120次我使用以下函数来sum
pagesVisited
满足两个条件的所有值,将结果放在URLSMMG
中的新列中:
# Calculate pageviews from MMG
for (i in 1:nrow(URLSMMG)) {
URLSMMG$pageviewsMMGClick[i] <- sum(pagesVisited[
which(pagesVisited[,11] == URLSMMG$URLWithoutParameters[i] &
grepl(paste0("ic=", URLSMMG$Code[i]), pagesVisited$evar3) == TRUE),3])
}
测量功能执行时间,它说功能需要大约4分钟才能结束。我很满意结果,因为输出是预期的,但我不确定我是否使用最快的方法进行计算。有人知道在更短的时间内做到这一点的另一种方式吗?
答案 0 :(得分:1)
以下内容应该更快:
## temporary vectors
pagesVisited11 <- pagesVisited[, 11]
URLWithoutParameters <- URLSMMG$URLWithoutParameters
Code <- URLSMMG$Code
evar3 <- gsub("ic=", "", pagesVisited$evar3)
pagesVisited3 <- pagesVisited[, 3]
pageviewsMMGClick <- numeric(nrow(URLSMMG))
## only touch vector inside loop
for (i in 1:nrow(URLSMMG)) {
cond1 <- pagesVisited11 == URLWithoutParameters[i]
cond2 <- grepl(Code[i], evar3)
pageviewsMMGClick[i] <- sum(pagesVisited3[cond1 & cond2])
}
## append new column to URLSMMG in the end
URLSMMG$pageviewsMMGClick <- pageviewsMMGClick
评论:
== TRUE
和which
,因为没有必要; paste0
;相反,我在循环外的"id="
中删除了evar3
。通过这种方式,您可以在每次迭代期间避免昂贵的paste0
。答案 1 :(得分:1)
以下是一些变量,主要是为了清晰起见,但在案例中
pv_code
从迭代中提升呼叫,以便执行一次
而不是100次。
pv_url <- pagesVisited[, 11]
pv_code <- sub("ic=", "", pagesVisited$evar3)
pv_click <- pagesVisited[, 3]
访问的每个页面属于一个组
grp <- match(pv_url, URLSMMG$URLWithoutParameters)
我们将此作为一个因素,并将所有URLWithoutParameters包括为
水平。这使得代码对于未出现的URL具有鲁棒性
pv_url
grp <- factor(grp, levels=seq_len(nrow(URLSMMG)))
我们只对某些行感兴趣
keep <- pv_code == URLSMMG$Code[grp]
我们现在想要过滤pv_click
并按群组加总
URLSMMG$pageviewsMMGClick <-
sapply(split(pv_click[keep], grp[keep]), sum)
(原始代码URLSMMG$pageviewsMMGClick[i] <- ...
中的对应行每次更新行元素时都会复制整个日期框,效率非常低;预先分配临时变量click = integer(nrow(URLSMMG)
会更好,在循环click[i] <- ...
期间填写,并在结束时更新一次URLSMMG,或者只使用sapply()
而不是担心预分配广告填充。)
作为一项功能,我们有
fun <- function(url, url_code, pv_url, pv_code, pv_click) {
stopifnot(!any(duplicated(url)))
grp <- factor(match(pv_url, url), levels=seq_along(url))
keep <- pv_code == url_code[grp]
unname(sapply(split(pv_click[keep], grp[keep]), sum))
}
这是对正确性的简短测试
url <- c("A", "B", "C")
url_code <- c( 1, 1, 1)
pv_url <- c("A", "A", "A", "C")
pv_code <- c( 1, 1, 2, 1)
pv_click <- c( 5, 6, 7, 8)
带输出
> fun(url, url_code, pv_url, pv_code, pv_click)
[1] 11 0 8
为了表现,这里的数据大小与原始问题相同
url <- as.character(1:374)
url_code <- sample(3, 374, TRUE)
pv_url <- sample(url, 99120, TRUE)
pv_code <- sample(url_code, 99120, TRUE)
pv_click <- rep(1, 99120)
和时间
> system.time(xx <- fun(url, url_code, pv_url, pv_code, pv_click))
user system elapsed
0.036 0.000 0.035
与原始版本相比,这似乎是10,000倍的加速。
答案 2 :(得分:1)
这是一种基于数据操作操作而不是循环的方法。在处理大数据时,data.table
包提供了显着的加速。
注意:在示例代码中,我假设pagesViewed
的第3列和第11列的名称分别为clicks
和url
。
library(data.table)
library(stringi)
library(dplyr)
# use data.table for speed
dt1 <- data.table(URLSMGG, key = "URLWithoutParameters")
dt2 <- data.table(pagesVisited, key = "url")
# generate the values used for the grepl-equivalent stri_detect_fixed
dt1[, ic_code := paste0("ic=", Code)]
viewsums <- dt2[dt1 # join the page data to the matching urls
][stri_detect_fixed(evar3, ic_code), # keep rows where ic_code is found in evar3
list(views = sum(clicks)), by = "url"] # sum the clicks for each url
# join the summed views to the url data
URLSMGG <- left_join(URLSMGG, viewsums, by = c("URLWithoutParameters" = "url")) %>%
mutate(views = ifelse(is.na(views), 0, views))
使用与Martin Morgan相同的测试数据,这里是这种方法的表现。我包含了两种不同的方案,一种是grepl
- 类似于evar3
的搜索,另一种是不需要它。
# preparing the testing data (succintly written by Martin Morgan)
urls <- as.character(1:374)
url_code <- sample(1:3, 374, TRUE)
pv_url <- sample(urls, 99120, TRUE)
pv_code <- sample(url_code, 99120, TRUE)
pv_click <- rep(1, 99120)
# and the corresponding data.frames
URLSMGG <- data.frame(URLWithoutParameters = urls, ic_code = url_code)
pagesVisited <- data.frame(url = pv_url, evar3 = pv_code, clicks = pv_click)
执行字符串搜索的第一个实现:
f1 <- function()
{
# use data.table for speed
dt1 <- data.table(URLSMGG, key = "URLWithoutParameters")
dt2 <- data.table(pagesVisited, key = "url")
viewsums <- dt2[dt1 # join the page data to the matching urls
][stri_detect_fixed(evar3, ic_code), # keep rows where ic_code is found in evar3
list(views = sum(clicks)), by = "url"] # sum the clicks for each url
# join the summed views to the url data
left_join(URLSMGG, viewsums, by = c("URLWithoutParameters" = "url")) %>%
mutate(views = ifelse(is.na(views), 0, views))
}
我们可以直接加入网址和代码的第二种情况:
f2 <- function()
{
# use data.table for speed
dt1 <- data.table(URLSMGG, key = c("URLWithoutParameters", "ic_code"))
dt2 <- data.table(pagesVisited, key = c("url", "evar3"))
# join the page data, matching urls and codes, and then sum clicks by url
viewsums <- dt2[dt1, list(views = sum(clicks)), by = "url"]
# join the summed views to the url data
left_join(URLSMGG, viewsums, by = c("URLWithoutParameters" = "url")) %>%
mutate(views = ifelse(is.na(views), 0, views))
}
最后表现:
library(microbenchmark)
microbenchmark(f1(), f2())
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1() 61.148200 62.919882 64.68540 64.396362 66.160684 70.65989 100
# f2() 7.532806 7.784006 10.40422 7.979846 8.579847 175.83275 100
(这些时间在英特尔酷睿i5-4460上,可能与其他结果相当或不同)