Question

我正在尝试对包含2000万行，10列的数据集执行以下简单查询，但是计算最终输出（30分钟）要花费很长时间。有没有更好的方法可以达到目的？

(t<-Sys.time())

rd_1<-as.data.frame(rd_1 %>%
group_by(customer,location_name,Location_Date,Location_Hour) %>%
filter(created_time==max(created_time))%>%
ungroup())

(t<-Sys.time())

下面是运行脚本后的时间戳。

[1] "2018-12-19 09:15:47 GMT"

> rd_1<-as.data.frame(rd_1 %>%
+ group_by(customer,location_name,Location_Date,Location_Hour) %>%
+ filter(created_time==max(created_time))%>%
+ ungroup())

> (t<-Sys.time())

[1] "2018-12-19 09:45:25 GMT"

Answer 1

尝试：

temp <- rd_1 %>% 
  group_by(customer,location_name,Location_Date,Location_Hour) %>%
  summarise(created_time = max(created_time)) %>%
  ungroup()

rd_1 <- rd_1 %>% 
   inner_join(temp) %>% 
   as.data.frame()

rm(temp)

为什么过滤器功能需要花费大量时间来提供输出[下面的示例]，还有没有更好的过滤方法？

1 个答案: