Question

希望有人之前已经处理过类似的事情 - 寻找一些建议！

我有700万客户和900个地点的数据集，我想找到每个客户最近的位置

在数据库SQL中，这很容易，但由于表格大小，我发现R很棘手！

我的标准SQL逻辑是交叉连接两个表并使用abs（）和Pythagoras找到最小距离，然后找到每个客户最近的商店。不是最有效但它有效。

然而，在我的桌面上的R（64位Windows，8GB RAM）中，数据集太大而无法交叉连接！我尝试使用下面的循环，这有效...

now <- Sys.time()

for(i in 1:nrow(customers)) {
between <- Sys.time()

  distance <- distm(x=stores[,. (longitude,latitude)],y=customers[i,.(longitude,latitude)])

 customers_closest_temp <- data.table(cust_id=customers[i]$cust_id,                                            
location_code=stores[which(distance==min(distance))]$location_code,
                                  distance=min(distance))

   customers_closest <- rbind(customers_closest,customers_closest_temp)

  print(paste0("Iteration: ",i,", ",round(difftime(Sys.time(),between,units="secs"),3)," seconds, duration: ",round(difftime(Sys.time(),now,units="secs"))," seconds"))

 }

....但每个客户需要0.2秒，这意味着需要16天才能运行（！）我猜是因为它一次缓慢地追加一行？我也试图限制我的交叉加入...

x<-unique(merge(customers, stores, by="key")[(((SQRT(((abs(x.longitude - y.longitude)) * (abs(x.longitude - y.longitude))) 
+ ((abs(x.latitude - y.latitude)) * (abs(x.latitude - y.latitude)))))/1000)) <= 10])

但没有运气？（想想我做错了）任何帮助将非常感谢:)谢谢！

R

0 个答案: