提高R中查找算法的效率

时间:2017-05-19 14:58:06

标签: r

我认为这是优化一段R代码的一项有趣任务。

我有一个数据框df_red,详细信息来自网店的订单。对于每种产品(ean),我希望将12种最有可能的其他产品放入购物篮中。

这是生成此类数据集的示例代码:

library(tidyverse)

# create a vector with 1400 products (characterized by their EANs)
eans <- sample(1e5:1e6, 1400, replace = FALSE)
# create a vector with 200k orders 
basket_nr <- 1:2e5

# a basket can have up to 4 items, it's most likely to have 3 items
n_prod_per_basket <- sample(x = 1:4, length(basket_nr), prob = c(0.2, 0.2, 0.5, 0.1), replace = TRUE)

# create df_red, each line of which correspond to a product with it's respective basket number
df <- data_frame(basket_nr, n_prod_per_basket)

df_red <- data_frame(basket_nr = rep(basket_nr, n_prod_per_basket))
df_red$ean <- sample(x = eans, nrow(df_red), replace = TRUE)

我用来完成此任务的代码如下。但我相信它不是一个有效的。如何提高程序的速度?

ean <- unique(df_red$ean)

out <- list()

for (i in 1:length(ean)){

ean1 <- ean[i]
# get all basket_nr that contain the ean in question
basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr

# get products that were together in the same basket with the ean in question
boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
prod <- df_red[boo, ]

# get top most frequent
top12 <- prod %>% 
group_by(ean) %>% 
summarise(n = n()) %>% 
arrange(desc(n)) %>% 
filter(row_number() %in% 1:12)

# skip products that weren't together in a basket with at least 12 different other products
if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)

if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))

}

3 个答案:

答案 0 :(得分:2)

性能改进当然是程度问题。在改进“足够”之前要走多远还很难说。但是,通过功能化代码和清理子集逻辑,我们可以将运行时间减少大约25%。从您的代码开始:

#added a timer
start.time <- Sys.time()
for (i in 1:length(ean)){

  ean1 <- ean[i]
  # get all basket_nr that contain the ean in question
  basket_nr <- df_red[df_red$ean == ean1, ]$basket_nr

  # get products that were together in the same basket with the ean in question
  boo <- (df_red$ean != ean1) & (df_red$basket_nr %in% basket_nr)
  prod <- df_red[boo, ]

  # get top most frequent
  top12 <- prod %>% 
    group_by(ean) %>% 
    summarise(n = n()) %>% 
    arrange(desc(n)) %>% 
    filter(row_number() %in% 1:12)

  # skip products that weren't together in a basket with at least 12 different other products
  if(nrow(top12) == 12) out[[i]] <- data_frame(ean = ean1, recom = top12$ean, freq = top12$n)

  if(i %% 100 == 0) print(paste0(round(i/length(ean)*100, 2), '% is complete'))

}
Sys.time() - start.time

我的机器需要30-34秒。但是我们可以将其重写为如下函数:

my.top12.func <- function(id, df_red) {
  #improved subsetting logic - using which is faster and we can remove some code by
  #removing the ean that is being iterated in the filter step below
  prod <- df_red[df_red$basket_nr %in% df_red$basket_nr[which(df_red$ean == id)], ]

  # set cutoff from 12 to 13 since the specific ean will always be one of the top 12
  top12 <- prod %>% 
    group_by(ean) %>% 
    summarise(n = n()) %>% 
    arrange(desc(n)) %>% 
    filter(row_number() %in% 1:13 & ean != id) #additional filter required

  # skip products that weren't together in a basket with at least 12 different other products
  if(nrow(top12) == 12) return(data_frame(ean = id, recom = top12$ean, freq = top12$n))
}

现在我们可以通过以下方式测试这种方法的速度和准确性:

start.time <- Sys.time()
my.out <- lapply(ean, my.top12.func, df_red = df_red)
Sys.time() - start.time

#test for equality
all.equal(out, my.out)

对于25%以上的改进,大约需要24-26秒。

答案 1 :(得分:2)

copy: { main: { expand: true, cwd: 'node_modules/bootstrap-sass/assets/fonts', src: '**', dest: 'public/assets/fonts', }, }, 时,我在不到7秒的时间内就产生了输出(我认为这大约提高了80%):

data.table

答案 2 :(得分:0)

我会考虑不使用循环。

df_red$k <- 1
df_s     <- left_join(df_red, df_red, by = "k") %>%
            filter(ean.x != ean.y & basket_nr.x == basket_nr.y) %>%
            group_by(ean.x) %>%
            summarise(n = n()) %>%
            arrange(desc(n)) %>%
            filter(row_number() %in% 1:13) 

df_s.ct  <- df_s %>% filter(row_number() == 12)
df_s.fin <- df_s[df_s$ean.x %in% df_s.ct$ean.x, ]

这里的速率限制步骤是left_join,它将数据集合并到自身,创建一个指数级更大的数据集(所以如果你有50,000个点,那么你最终会创建一个2.5B点的新数据集)。它现在表明存储和操作数据的最佳方法是使用data.table,这将提高此过程的速度,尤其是与dplyr结合使用时。