Question

我需要计算未来7天内特定客户的未来访问量。我用purrr:map2解决了这个问题但是我遇到了非常慢的表现。我想我必须遗漏一些关于如何使用purrr的基本知识。我如何加快速度？感谢

这个玩具示例需要2.3秒，100行，但在我的机器上需要3.3分钟，1000行。我的实际数据有400K行！

library(tidyverse)
set.seed(123)
rows <- 1000
df= data.frame(cust_num = sample(c("123","124","128"),rows,replace=T), 
               date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by="day"), rows, replace=T))

df <- df %>%
  rowwise() %>%
  mutate( visits.next.7.days = map2_lgl(df$cust_num,df$date,~.x==cust_num&.y>date&.y<(date+7)) %>% sum() )

Answer 1

使用zoo包的解决方案。我们的想法是按cust_num和date对数据进行分组，然后先计算行号，然后使用lead函数将计数数字移1并使用rollapply计算接下来六天的总和（不包括开始日期）。最后，使用left_join将结果合并回原始数据框。这应该比原始方法快得多。 df3是最终输出。

library(dplyr)
library(zoo)
df2 <- df %>%
  count(cust_num, date) %>%
  ungroup() %>%
  mutate(n2 = lead(n)) %>%
  mutate(visits.next.7.days = rollapply(n2, width = 6, FUN = sum, na.rm = TRUE, 
                                        align = "left", partial = TRUE)) %>%
  select(cust_num, date, visits.next.7.days)


df3 <- df %>% left_join(df2, by = c("cust_num", "date"))

head(df3)
#   cust_num       date visits.next.7.days
# 1      123 2017-01-09                 70
# 2      128 2017-01-19                 54
# 3      124 2017-01-05                 58
# 4      128 2017-01-27                 37
# 5      128 2017-01-27                 37
# 6      123 2017-01-15                 68

Answer 2

这是一个使用purrr::reduce对data.table::shift（lead / lag的矢量化版本）返回的向量列表求和的选项。如果您愿意，pmap_int与sum reduce与+ map(1:7, ~lead(n, .x, default = 0L))相同，但速度稍慢。您可以类似地data.table::shift代替library(tidyverse) set.seed(123) rows <- 1000 df = data.frame(cust_num = sample(c("123","124","128"), rows, replace = TRUE), date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by = "day"), rows, replace = TRUE)) df2 <- df %>% count(cust_num, date) %>% group_by(cust_num) %>% # add dates with no occurrences; none in sample data, but quite possible in real complete(date = seq(min(date), max(date), by = 'day'), fill = list(n = 0L)) %>% mutate(visits_next_7 = reduce(data.table::shift(n, 1:7, type = 'lead', fill = 0L), `+`)) %>% right_join(df) df2 #> # A tibble: 1,000 x 4 #> # Groups: cust_num [?] #> cust_num date n visits_next_7 #> <fctr> <date> <int> <int> #> 1 123 2017-01-09 10 78 #> 2 128 2017-01-19 12 70 #> 3 124 2017-01-05 15 73 #> 4 128 2017-01-27 14 37 #> 5 128 2017-01-27 14 37 #> 6 123 2017-01-15 19 74 #> 7 124 2017-01-24 12 59 #> 8 128 2017-01-10 10 78 #> 9 124 2017-01-03 19 77 #> 10 124 2017-01-14 8 84 #> # ... with 990 more rows，但代码更多，速度更慢。

complete

这可能不是最有效的算法，因为根据数据的间距，[('Paul George', 1), ('Luke Skywalker', 2), ('Mitchell Piker', 3), ('Phil Dam', 1)]可能会极大地扩展您的数据。

此外，对于此大小的数据，您可能会发现data.table更实用，除非您想将数据放入数据库并使用dplyr访问它。

并计算下周客户的所有访问量

2 个答案: