R - 具有多种条件的总和的更好性能

时间:2016-06-18 21:59:52

标签: r

有两个数据帧:

  1. URLSMMG,共有374次观察
  2. pagesVisited,观看次数为99120次
  3. 我使用以下函数来sum pagesVisited满足两个条件的所有值,将结果放在URLSMMG中的新列中:

    # Calculate pageviews from MMG
    for (i in 1:nrow(URLSMMG)) {
            URLSMMG$pageviewsMMGClick[i] <- sum(pagesVisited[
            which(pagesVisited[,11] == URLSMMG$URLWithoutParameters[i] &
            grepl(paste0("ic=", URLSMMG$Code[i]), pagesVisited$evar3) == TRUE),3])
    }
    

    测量功能执行时间,它说功能需要大约4分钟才能结束。我很满意结果,因为输出是预期的,但我不确定我是否使用最快的方法进行计算。有人知道在更短的时间内做到这一点的另一种方式吗?

3 个答案:

答案 0 :(得分:1)

以下内容应该更快:

## temporary vectors
pagesVisited11 <- pagesVisited[, 11]
URLWithoutParameters <- URLSMMG$URLWithoutParameters
Code <- URLSMMG$Code
evar3 <- gsub("ic=", "", pagesVisited$evar3)
pagesVisited3 <- pagesVisited[, 3]
pageviewsMMGClick <- numeric(nrow(URLSMMG))

## only touch vector inside loop
for (i in 1:nrow(URLSMMG)) {
  cond1 <- pagesVisited11 == URLWithoutParameters[i]
  cond2 <- grepl(Code[i], evar3)
  pageviewsMMGClick[i] <- sum(pagesVisited3[cond1 & cond2])
  }

## append new column to URLSMMG in the end
URLSMMG$pageviewsMMGClick <- pageviewsMMGClick

评论:

  1. 为了提高内存效率,请勿触摸循环内的数据帧。这就是为什么我在循环之前提取所有相关向量,并且只在循环内使用向量;
  2. 我删除了== TRUEwhich,因为没有必要;
  3. 我还在循环中删除了paste0;相反,我在循环外的"id="中删除了evar3。通过这种方式,您可以在每次迭代期间避免昂贵的paste0

答案 1 :(得分:1)

以下是一些变量,主要是为了清晰起见,但在案例中 pv_code从迭代中提升呼叫,以便执行一次 而不是100次。

pv_url <- pagesVisited[, 11]
pv_code <- sub("ic=", "", pagesVisited$evar3)
pv_click <- pagesVisited[, 3]

访问的每个页面属于一个组

grp <- match(pv_url, URLSMMG$URLWithoutParameters)

我们将此作为一个因素,并将所有URLWithoutParameters包括为 水平。这使得代码对于未出现的URL具有鲁棒性 pv_url

grp <- factor(grp, levels=seq_len(nrow(URLSMMG)))

我们只对某些行感兴趣

keep <- pv_code == URLSMMG$Code[grp]

我们现在想要过滤pv_click并按群组加总

URLSMMG$pageviewsMMGClick <-
    sapply(split(pv_click[keep], grp[keep]), sum)

(原始代码URLSMMG$pageviewsMMGClick[i] <- ...中的对应行每次更新行元素时都会复制整个日期框,效率非常低;预先分配临时变量click = integer(nrow(URLSMMG)会更好,在循环click[i] <- ...期间填写,并在结束时更新一次URLSMMG,或者只使用sapply()而不是担心预分配广告填充。)

作为一项功能,我们有

fun <- function(url, url_code, pv_url, pv_code, pv_click) {
    stopifnot(!any(duplicated(url)))
    grp <- factor(match(pv_url, url), levels=seq_along(url))
    keep <- pv_code == url_code[grp]
    unname(sapply(split(pv_click[keep], grp[keep]), sum))
}

这是对正确性的简短测试

url <-     c("A", "B", "C")
url_code <- c( 1,   1,   1)

pv_url <-   c("A", "A", "A", "C")
pv_code <-  c( 1,   1,   2,   1)
pv_click <- c( 5,   6,   7,   8)

带输出

> fun(url, url_code, pv_url, pv_code, pv_click)
[1] 11  0  8

为了表现,这里的数据大小与原始问题相同

url  <-     as.character(1:374)
url_code <- sample(3, 374, TRUE)

pv_url <-   sample(url, 99120, TRUE)
pv_code <-  sample(url_code, 99120, TRUE)
pv_click <- rep(1, 99120)

和时间

> system.time(xx <- fun(url, url_code, pv_url, pv_code, pv_click))
   user  system elapsed 
  0.036   0.000   0.035 

与原始版本相比,这似乎是10,000倍的加速。

答案 2 :(得分:1)

这是一种基于数据操作操作而不是循环的方法。在处理大数据时,data.table包提供了显着的加速。

注意:在示例代码中,我假设pagesViewed的第3列和第11列的名称分别为clicksurl

library(data.table)
library(stringi)
library(dplyr)

# use data.table for speed
dt1 <- data.table(URLSMGG, key = "URLWithoutParameters")
dt2 <- data.table(pagesVisited, key = "url")

# generate the values used for the grepl-equivalent stri_detect_fixed
dt1[, ic_code := paste0("ic=", Code)]

viewsums <- dt2[dt1  # join the page data to the matching urls
    ][stri_detect_fixed(evar3, ic_code),  # keep rows where ic_code is found in evar3
      list(views = sum(clicks)), by = "url"]  # sum the clicks for each url

# join the summed views to the url data
URLSMGG <- left_join(URLSMGG, viewsums, by = c("URLWithoutParameters" = "url")) %>%
    mutate(views = ifelse(is.na(views), 0, views))

使用与Martin Morgan相同的测试数据,这里是这种方法的表现。我包含了两种不同的方案,一种是grepl - 类似于evar3的搜索,另一种是不需要它。

# preparing the testing data (succintly written by Martin Morgan)
urls <-     as.character(1:374)
url_code <- sample(1:3, 374, TRUE)

pv_url <-   sample(urls, 99120, TRUE)
pv_code <-  sample(url_code, 99120, TRUE)
pv_click <- rep(1, 99120)

# and the corresponding data.frames
URLSMGG <- data.frame(URLWithoutParameters = urls, ic_code = url_code)
pagesVisited <- data.frame(url = pv_url, evar3 = pv_code, clicks = pv_click)

执行字符串搜索的第一个实现:

f1 <- function()
{
    # use data.table for speed
    dt1 <- data.table(URLSMGG, key = "URLWithoutParameters")
    dt2 <- data.table(pagesVisited, key = "url")

    viewsums <- dt2[dt1  # join the page data to the matching urls
        ][stri_detect_fixed(evar3, ic_code),  # keep rows where ic_code is found in evar3
          list(views = sum(clicks)), by = "url"]  # sum the clicks for each url

    # join the summed views to the url data
    left_join(URLSMGG, viewsums, by = c("URLWithoutParameters" = "url")) %>%
        mutate(views = ifelse(is.na(views), 0, views))
}

我们可以直接加入网址和代码的第二种情况:

f2 <- function()
{
    # use data.table for speed
    dt1 <- data.table(URLSMGG, key = c("URLWithoutParameters", "ic_code"))
    dt2 <- data.table(pagesVisited, key = c("url", "evar3"))

    # join the page data, matching urls and codes, and then sum clicks by url
    viewsums <- dt2[dt1, list(views = sum(clicks)), by = "url"]

    # join the summed views to the url data
    left_join(URLSMGG, viewsums, by = c("URLWithoutParameters" = "url")) %>%
        mutate(views = ifelse(is.na(views), 0, views))
}

最后表现:

library(microbenchmark)
microbenchmark(f1(), f2())
#     Unit: milliseconds
#      expr       min        lq     mean    median        uq       max neval
#      f1() 61.148200 62.919882 64.68540 64.396362 66.160684  70.65989   100
#      f2()  7.532806  7.784006 10.40422  7.979846  8.579847 175.83275   100

(这些时间在英特尔酷睿i5-4460上,可能与其他结果相当或不同)