更新(再次)中间结果,见下文!
我有一个人的数据集(简单地说,就是说),每个组合(对)都有一个与之相关的值(假设这是他们彼此认识的年数)。例如,萨姆威尔已经认识乔恩2年,布兰认识0年等等:
Jon Samwell Bran
Jon NA 2 10
Samwell 2 NA 0
Bran 10 0 NA
真实数据是关于1000x1000矩阵,需要在正在使用的模拟的每次迭代中更新,并且可能有100,000次迭代运行,并且有数百次运行。所以速度很重要。
我需要查找这些值,并且查找可以来自该对中的任何一个成员(非计算模拟会问,“嘿Jon,你知道Bran多久了?”),还要更新这些价值观往往不是。我当然可以使用类似的矩阵:
# R code
dat = matrix(c(NA,2,10,2,NA,0,10,0,NA), ncol=3)
row.names(dat) = c("Jon", "Samwell", "Bran")
colnames(dat) = c("Jon", "Samwell", "Bran")
# Jon -> Bran?
dat["Jon", "Bran"]
这样做有效,但每次更新信息时,我都需要更新矩阵中的两个值["Jon", "Bran"]
和["Bran", "Jon"]
。此外,它是相同的值,所以将它保存在两个地方似乎是多余的。我只能使用矩阵的一个(例如,较低的)三角形,但是应该以某种方式对查找进行排序,因为:
Jon Samwell Bran
Jon NA NA NA
Samwell 2 NA NA
Bran 10 0 NA
dat["Jon", "Bran"] # returns NA...
我可以为人们编号(或按字母顺序排序)然后执行类似
的操作dat[max(1,3), min(1,3] # correctly returns 10
...但是这也感觉很迂回,并且可能总是很慢地计算最小值和最大值(请记住,大约100x100k次)。另一个选项是一个列表,它将保存这些数据的三元组(dat2 = list(c("Jon", "Bran", 10), c("Jon", "Samwell", 2)
),然后使用grep或其他任何查找,但这似乎也不是一个可行的想法。
我需要的是一种允许我打电话的结构:
# pseudo R code
dataset[ "Jon", "Bran" ] # returns 10
dataset[ "Bran", "Jon" ] # returns the same, 10
dataset[ "Bran", "Jon" ] = 11 # changes value
dataset[ "Jon", "Bran" ] # now returns 11
以这样的方式表示这些关系的良好数据库结构(在R中)是什么,我可以快速查找任何对之间的任何值,并在需要时修改每对?
### Lower-triangle matrix approach
# simulate some data - using numbers instead of names for simplicity:
dataset = matrix(1:1e+06, 1000,1000)
dataset[upper.tri(dataset)] = NA
## My initial suggestion with min, max
f0 = function(r, c){
dataset[max(r, c), min(r, c)]
}
## Approach with ifelse, is.na
f1 <- function(rowname, colname) {
ifelse(is.na(dataset[rowname,colname]), dataset[colname, rowname],
dataset[rowname, colname])
}
## Approach with flipping the triangles
f1.1 <- function(m) {
m[upper.tri(m)] <- t(m)[upper.tri(m)]
return(m)
}
### Named vector approach
# simulate some data first:
names=character(); length(names)=1e+06
dat = 1:1e+06
ix = 1
for(i in 1:1000){
for(j in 1:1000){
names[ix] = paste(i,j,sep=":")
ix = ix+1
}
}
names(dat) = names
# The Indexing Function
f2<-function(rowname,colname){
ifelse(rowname>colname,
paste(colname,rowname,sep=":"),
paste(rowname,colname,sep=":"))
}
### Testing time! ###
r = sample(1:100,100)
c = sample(1:100,100)
library(microbenchmark)
microbenchmark(f0 = for(i in 1:100){f0(r[i],c[i])}, # min, max
f1 = for(i in 1:100){f1(r[i],c[i])}, # ifelse, is.na
ft = for(i in 1:100){dataset = f1.1(dataset); dataset[r[i], c[i]]}, # triangles
f2 = for(i in 1:100){dat[f2(r[i],c[i])]}, # named vector
times=10 # only 10, but otherwise took too long
)
结果:
Unit: microseconds
expr min lq mean median uq max
f0 225.355 229.818 263.1525 253.246 268.641 386.897
f1 365.923 377.971 437.055 414.341 451.156 677.851
ft 15252785 15316552 15566444 15448666 15589053 16657835
f2 24486482.495 25188285.201 26106770 25428894.256 26610528.584 31256624.432
使用@alexis_laz的双写建议更新29.03
# Suggestion to write to two locations in the matrix, so it could be accessed either way
f2x = function(r, c){
dataset[r,c] = 100
dataset[c,r] = 100
}
# Testing, this time a writing operation is included for comparison
microbenchmark(f0 = for(i in 1:100){dataset[r,c] = 100; f0(r[i],c[i])}, #original min,max
f2x = for(i in 1:100){f2x(r[i],c[i]) ; dataset[r,c]}, # write to two slots
times=10, unit = "us")
Unit: microseconds
expr min lq mean median uq max
f0 25843.98 25922.07 30032.01 26515.36 33039.72 49811.93
f2x 590426.03 680729.11 778541.39 706079.99 785109.53 1178319.16
# nope, this does not improve over the writing/accessing the lower triangle
...所以我天真的低三角矩阵最小 - 最大索引方法仍然获胜。这真的是最好的答案吗?
答案 0 :(得分:0)
对于这种方法,请参阅Copy upper triangle to lower triangle for several matrices in a list。如果您只想设置下对角线,则可以使用函数将这些值复制到上对角线。
dat = matrix(c(NA,2,10,NA,NA,0,NA,NA,NA),3,3)
dat
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] 2 NA NA
# [3,] 10 0 NA
f <- function(m) {
m[upper.tri(m)] <- t(m)[upper.tri(m)]
m
}
f(dat)
# [,1] [,2] [,3]
# [1,] NA 2 10
# [2,] 2 NA 0
# [3,] 10 0 NA
另一个想法是
f <- function(rowname, colname) {
ifelse(is.na(dataset[rowname,colname]), dataset[colname, rowname],
dataset[rowname, colname])
}
f("Jon", "Bran") # same as f("Bran", "Jon")
这可能是最好的方法,但假设如果数据集[“Jon”,“Bran”]为NA,那么数据集[“Bran”,“Jon”]将不会。
答案 1 :(得分:0)
您可以存储在指定的数字列表中,而不是矩阵:
dat<-c(2,10,0)
names(dat)<-c("Jon:Samwell","Bran:Jon","Bran:Samwell")
> dat
Jon:Samwell Bran:Jon Bran:Samwell
2 10 0
### The Indexing Function
f<-function(rowname,colname){
ifelse(rowname>colname,
paste(colname,rowname,sep=":"),
paste(rowname,colname,sep=":"))
}
测试:
> dat[f("Jon","Bran")]
Bran:Jon
10
> dat[f("Bran","Jon")]
Bran:Jon
10
> dat[f("Bran","Jon")] = 11
> dat[f("Jon","Bran")]
Bran:Jon
11
## Can accept list too:
> dat[f(c("Jon","Samwell"),c("Bran","Jon"))]
Bran:Jon Jon:Samwell
11 2
检查速度:
> set.seed(1)
> a<-sample(c("Jon","Samwell","Bran"),1000000,replace=T)
> b<-sample(c("Jon","Samwell","Bran"),1000000,replace=T)
> system.time(out<-dat[f(a,b)])
user system elapsed
1.212 0.028 1.244