我想在公司之间使用他们的地理位置创建距离矩阵。
我有一个方形距离矩阵,包含98个意大利省份之间的距离。我还有一个包含两列的数据框。一列有8376家公司的ID号。另一栏显示了这些公司中的每一个所在的98个省中的哪个省。
我想创建一个8376乘8376距离矩阵,其中包含所有公司之间的距离。我写的代码(下面)非常低效。反正有没有更快地做到这一点?我问,因为我需要多个数据集。
这就是数据框的样子
cid province
1 61 TO
2 102 TO
3 123 AT
4 127 TO
5 158 TO
6 225 NO
7 232 TO
8 388 TO
这是方形距离矩阵的样子
CH AQ PE TE
1 0 64.39 41.74 81.18
2 64.39 0 40.38 61.05
3 41.74 40.38 0 40.79
4 81.18 61.05 40.79 0
outcome = matrix(NA,8376,8376) # empty matrix
for(i in 1:8376){
for(j in (i+1):8376){
x=which(dist.codes[,1]==companyID_Province[i,2]) # Find the row index in the distance matrix
y=which(dist.codes[1,]==companyID_Province[j,2]) # Find the column index in the distance matrix
outcome[i,j] = dist.codes[x,y] # Specify the distance to the corresponding element in outcome matrix
}
}
答案 0 :(得分:1)
如果dist.codes
是各省的距离矩阵,province[i]
是ID为i
的公司所在省,则dist.codes[province,province]
是公司的距离矩阵。
如果company
是公司ID位于company$ID
且省份位于company$province
的数据框,则company$province[order(company$ID)]
是上面的向量province
,按公司排序ID's。
我已将您的代码与我的提案进行比较:
SpeedComparison <- function(N,M)
{
set.seed(1)
dist.codes <- matrix(sample(1:1000,N*N,rep=TRUE),N,N) / 100
dist.codes <- dist.codes * t(dist.codes)
diag(dist.codes) <- 0
dist.codes <- cbind(0:N,rbind(1:N,dist.codes)) # Add an additional row and an additional column with province numbers.
companyID_Province <- data.frame( ID = 1:M, province = sample(1:N,M,replace=TRUE) )
#---------------------------------------------------------------------
tm.1 <- 0.01 * system.time(
for ( i in 1:100)
{
outcome.1 = matrix(0,M,M) # empty matrix
for(i in 1:(M-1)){
x=which(dist.codes[,1]==companyID_Province[i,2]) # Find the row index in the distance matrix
for(j in (i+1):M){
y=which(dist.codes[1,]==companyID_Province[j,2]) # Find the column index in the distance matrix
outcome.1[i,j] = dist.codes[x,y] # Specify the distance to the corresponding element in outcome matrix
}
}
}
)
tm.2 <- 0.01 * system.time(
for ( i in 1:100)
{
D <- dist.codes[-1,][,-1] # The additional row/column is not used here.
outcome.2 <- D[companyID_Province[,2],companyID_Province[,2]]
}
)
list( outcome = list( outcome.1+t(outcome.1), outcome.2 ),
time = list( tm.1, tm.2 ) )
}
#======================================================================
N <- 50
Comparison <- as.data.frame(matrix(NA,0,4))
for ( M in c(100,150,200,250,300) )
{
Test <- SpeedComparison(N,M)
Comparison <- rbind( Comparison,
c( M,
Test$time[[1]][3],
Test$time[[2]][3],
identical(Test$outcome[[1]],Test$outcome[[2]])))
}
names(Comparison) <- c("M","time.1","time.2","outcomes.identical")
outcome
s是相等的(“1”表示为TRUE),时间是相等的:
> Comparison
M time.1 time.2 outcomes.identical
1 100 0.2568 2e-04 1
2 150 0.5661 5e-04 1
3 200 1.1845 7e-04 1
4 250 1.9568 1e-03 1
5 300 2.8602 4e-03 1
>