在dplyr的最新版本中,spread()
和gather()
被标记为 lifecycle:retired 。 pivot_wider()
和pivot_longer()
。
我的问题是新功能需要更多的键入操作,并且执行速度似乎较慢。我想知道我做错了什么。
示例数据:
library(tidyverse)
dates <- seq(from = as.Date("1975-01-01"), to = as.Date("2019-10-31"), by = "months")
returndata <- tibble(stock = sort(rep(letters, length(dates))),
month = rep(dates, length(letters)),
ret = runif(length(dates) * length(letters)) - 0.5)
以前,我将数据分布如下:
returndata_spread <- returndata %>%
spread(stock, ret)
使用pivot_wider,我会这样做:
returndata_wider <- returndata %>%
pivot_wider(names_from = stock, values_from = ret)
结果完全一样。
要先收集:
returndata_gather <- returndata_wider %>%
gather(stock, ret, -month)
现在有了pivot_longer:
returndata_longer <- returndata_wider %>%
pivot_longer(-month, names_to = "stock", values_to = "ret") %>%
arrange(stock, month)
我测量执行时间并得到以下信息:
> t_spread
Time difference of 0.01287794 secs
> t_wider
Time difference of 0.4083362 secs
> t_gather
Time difference of 0.002280474 secs
> t_longer
Time difference of 0.01168776 secs
新功能要慢得多。
答案 0 :(得分:1)
这似乎是Github上this问题的另一个实例,应该在tidyr
的开发版本中修复。更新tidyr
(即devtools::install_github("tidyverse/tidyr")
)后,您的示例获得了可比的性能:
library(tidyverse)
dates <- seq(from = as.Date("1975-01-01"), to = as.Date("2019-10-31"), by = "months")
returndata <- tibble(stock = sort(rep(letters, length(dates))),
month = rep(dates, length(letters)),
ret = runif(length(dates) * length(letters)) - 0.5)
bench::mark(
spread = returndata %>% spread(stock, ret),
pivot_wider = returndata %>% pivot_wider(names_from = stock, values_from = ret)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spread 8.83ms 9.57ms 100. 0B 6.39
#> 2 pivot_wider 10.96ms 11.37ms 86.1 0B 4.42
由reprex package(v0.3.0)于2019-11-25创建