考虑这个简单的例子
library(lubridate)
library(dplyr)
df1 <- tibble(timestamp = c(ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123')))
df2 <- tibble(timestamp = c(ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'))) %>%
mutate(timestamp = as.numeric(timestamp))
如您所见,df1
和df2
之间的唯一区别是时间戳记的表示形式。
不要看看时间上的疯狂差异
#first lets make them bigger. 400k rows is enough
df1 <- map_dfr(seq(1:100000), ~df1)
df2 <- map_dfr(seq(1:100000), ~df2)
现在简单的计算
> microbenchmark(
+ df2 %>% mutate(diff = timestamp - min(timestamp)),
+ times = 1000)
Unit: milliseconds
expr min lq mean median
df2 %>% mutate(diff = timestamp - min(timestamp)) 1.541533 2.182028 3.961685 2.327694
uq max neval
2.567314 290.823 1000
同时
> microbenchmark(
+ df1 %>% mutate(diff = timestamp - min(timestamp)),
+ times = 1000)
Unit: milliseconds
expr min lq mean median
df1 %>% mutate(diff = timestamp - min(timestamp)) 4.111016 8.182359 13.1351 8.513956
uq max neval
9.065631 378.1961 1000
轰!慢3倍以上。这是为什么? 谢谢!