更好的结构对于对的值来进行更快的查找(在R中)

时间:2016-03-25 15:50:19

标签: r database matrix lookup

更新(再次)中间结果,见下文!

背景

我有一个人的数据集(简单地说,就是说),每个组合(对)都有一个与之相关的值(假设这是他们彼此认识的年数)。例如,萨姆威尔已经认识乔恩2年,布兰认识0年等等:

      Jon Samwell Bran
Jon     NA   2    10 
Samwell 2    NA   0
Bran    10   0    NA

真实数据是关于1000x1000矩阵,需要在正在使用的模拟的每次迭代中更新,并且可能有100,000次迭代运行,并且有数百次运行。所以速度很重要。

Problemo

我需要查找这些值,并且查找可以来自该对中的任何一个成员(非计算模拟会问,“嘿Jon,你知道Bran多久了?”),还要更新这些价值观往往不是。我当然可以使用类似的矩阵:

# R code
dat = matrix(c(NA,2,10,2,NA,0,10,0,NA), ncol=3)
row.names(dat) = c("Jon", "Samwell", "Bran")
colnames(dat) = c("Jon", "Samwell", "Bran")
# Jon -> Bran?
dat["Jon", "Bran"] 

这样做有效,但每次更新信息时,我都需要更新矩阵中的两个值["Jon", "Bran"]["Bran", "Jon"]。此外,它是相同的值,所以将它保存在两个地方似乎是多余的。我只能使用矩阵的一个(例如,较低的)三角形,但是应该以某种方式对查找进行排序,因为:

      Jon Samwell Bran
Jon     NA   NA   NA 
Samwell 2    NA   NA
Bran    10   0    NA

dat["Jon", "Bran"] # returns NA...

我可以为人们编号(或按字母顺序排序)然后执行类似

的操作
dat[max(1,3), min(1,3] # correctly returns 10

...但是这也感觉很迂回,并且可能总是很慢地计算最小值和最大值(请记住,大约100x100k次)。另一个选项是一个列表,它将保存这些数据的三元组(dat2 = list(c("Jon", "Bran", 10), c("Jon", "Samwell", 2)),然后使用grep或其他任何查找,但这似乎也不是一个可行的想法。

问题

我需要的是一种允许我打电话的结构:

# pseudo R code
dataset[ "Jon", "Bran" ]       # returns 10
dataset[ "Bran", "Jon" ]       # returns the same, 10
dataset[ "Bran", "Jon" ] = 11  # changes value
dataset[ "Jon", "Bran" ]       # now returns 11

以这样的方式表示这些关系的良好数据库结构(在R中)是什么,我可以快速查找任何对之间的任何值,并在需要时修改每对?

更新:方法的中间比较:

### Lower-triangle matrix approach
# simulate some data - using numbers instead of names for simplicity:
dataset = matrix(1:1e+06, 1000,1000)
dataset[upper.tri(dataset)] = NA

## My initial suggestion with min, max
f0 = function(r, c){ 
    dataset[max(r, c), min(r, c)]
}

## Approach with ifelse, is.na
f1 <- function(rowname, colname) {
    ifelse(is.na(dataset[rowname,colname]), dataset[colname, rowname], 
           dataset[rowname, colname])
}

## Approach with flipping the triangles
f1.1 <- function(m) {
    m[upper.tri(m)] <- t(m)[upper.tri(m)]
    return(m)
}

### Named vector approach 
# simulate some data first:
names=character(); length(names)=1e+06
dat = 1:1e+06
ix = 1
for(i in 1:1000){
    for(j in 1:1000){
        names[ix] = paste(i,j,sep=":")
        ix = ix+1
    }
}
names(dat) = names

# The Indexing Function
f2<-function(rowname,colname){
  ifelse(rowname>colname,
         paste(colname,rowname,sep=":"),
         paste(rowname,colname,sep=":"))
}  


### Testing time! ###

r = sample(1:100,100)
c = sample(1:100,100)
library(microbenchmark)

microbenchmark(f0 = for(i in 1:100){f0(r[i],c[i])},       # min, max
               f1 = for(i in 1:100){f1(r[i],c[i])},       # ifelse, is.na
               ft = for(i in 1:100){dataset = f1.1(dataset); dataset[r[i], c[i]]}, # triangles
               f2 = for(i in 1:100){dat[f2(r[i],c[i])]},  # named vector
               times=10 # only 10, but otherwise took too long
               )

结果:

Unit: microseconds
 expr          min           lq         mean       median           uq          max
   f0      225.355      229.818     263.1525      253.246      268.641      386.897
   f1      365.923      377.971      437.055      414.341      451.156      677.851
   ft     15252785     15316552     15566444     15448666     15589053     16657835
   f2 24486482.495 25188285.201     26106770 25428894.256 26610528.584 31256624.432

使用@alexis_laz的双写建议更新29.03

# Suggestion to write to two locations in the matrix, so it could be accessed either way
f2x = function(r, c){ 
    dataset[r,c] = 100
    dataset[c,r] = 100
}

# Testing, this time a writing operation is included for comparison
microbenchmark(f0  = for(i in 1:100){dataset[r,c] = 100; f0(r[i],c[i])}, #original min,max
               f2x = for(i in 1:100){f2x(r[i],c[i])    ; dataset[r,c]},  # write to two slots
               times=10, unit = "us")

Unit: microseconds
 expr       min        lq      mean    median        uq        max 
   f0  25843.98  25922.07  30032.01  26515.36  33039.72   49811.93 
  f2x 590426.03 680729.11 778541.39 706079.99 785109.53 1178319.16 
# nope, this does not improve over the writing/accessing the lower triangle 

...所以我天真的低三角矩阵最小 - 最大索引方法仍然获胜。这真的是最好的答案吗?

2 个答案:

答案 0 :(得分:0)

对于这种方法,请参阅Copy upper triangle to lower triangle for several matrices in a list。如果您只想设置下对角线,则可以使用函数将这些值复制到上对角线。

dat = matrix(c(NA,2,10,NA,NA,0,NA,NA,NA),3,3)
dat

#      [,1] [,2] [,3]
# [1,]   NA   NA   NA
# [2,]    2   NA   NA
# [3,]   10    0   NA

f <- function(m) {
    m[upper.tri(m)] <- t(m)[upper.tri(m)]
    m
}

f(dat)

#      [,1] [,2] [,3]
# [1,]   NA    2   10
# [2,]    2   NA    0
# [3,]   10    0   NA

另一个想法是

f <- function(rowname, colname) {
    ifelse(is.na(dataset[rowname,colname]), dataset[colname, rowname], 
           dataset[rowname, colname])
}

f("Jon", "Bran") # same as f("Bran", "Jon")

这可能是最好的方法,但假设如果数据集[“Jon”,“Bran”]为NA,那么数据集[“Bran”,“Jon”]将不会。

答案 1 :(得分:0)

您可以存储在指定的数字列表中,而不是矩阵:

dat<-c(2,10,0)
names(dat)<-c("Jon:Samwell","Bran:Jon","Bran:Samwell")

> dat
 Jon:Samwell     Bran:Jon Bran:Samwell
           2           10            0

### The Indexing Function
f<-function(rowname,colname){
  ifelse(rowname>colname,
         paste(colname,rowname,sep=":"),
         paste(rowname,colname,sep=":"))
}   

测试:

> dat[f("Jon","Bran")]
Bran:Jon
      10

> dat[f("Bran","Jon")]
Bran:Jon
      10

> dat[f("Bran","Jon")] = 11

> dat[f("Jon","Bran")]
Bran:Jon
      11

## Can accept list too:
> dat[f(c("Jon","Samwell"),c("Bran","Jon"))]
   Bran:Jon Jon:Samwell 
         11           2 

检查速度:

> set.seed(1)
> a<-sample(c("Jon","Samwell","Bran"),1000000,replace=T)
> b<-sample(c("Jon","Samwell","Bran"),1000000,replace=T)
> system.time(out<-dat[f(a,b)])
   user  system elapsed 
  1.212   0.028   1.244