如何在r data.table

时间:2019-10-16 09:35:47

标签: r data.table

编辑:在我的第一个示例中有很多问题,因此我在这里对其进行了重新设计。这主要是为了向最初的响应者表示感谢,即使我的例子很差,他们也将我的处理时间减少了约180倍。这个问题因不清楚或不够笼统而被冻结,但我认为它具有价值,因为data.table可以使用正确的语法来完成令人惊奇的事情,但是即使使用可用的小插图,该语法也难以捉摸。根据我自己的经验,有更多关于如何使用data.table的示例将很有帮助。特别是对于那些开始使用Excel的人来说,这里的VLOOKUP之类的行为填补了并非总是容易找到的空白。

此示例中可能引起普遍关注的特定事情是:

  1. 在一个data.table中查找值,在另一个data.table中
  2. 按名称和引用传递变量
  3. 在data.table中应用类似行为

修改后的原始问题(有限行)示例:

我正在寻找神秘的data.table世界,传递函数以及在多个表之间快速使用查找方面的帮助。我有一个更大的函数,当我对其进行概要分析时,似乎将其所有时间都花在了这一方面,以进行一些相当简单的查找和求和操作。我没有足够的能力进行概要分析,无法准确找出导致该问题的调用的哪些子区域,但是我的猜测是我无意间在执行不需要执行的计算量大的操作。 Data.table语法对我来说仍然是个谜,因此我在这里寻求帮助以加快此过程。

小型示例:

library(data.table)
set.seed(seed = 911)
##Other parts of the analysis generate all of these data.tables
#A data table containing id values (the real version has other things too)
whoamI<-data.table(id=1:5)
#The result of another calculation it tells me how many neighbors I will be interested in
#the real version has many more columns in it.
howmanyneighbors<-data.table(id=1:5,toCount=round(runif(5,min=1,max=3),0))
#Who the first three neighbors are for each id
#real version has a hundreds of neighbors
myneighborsare<-data.table(id=1:5,matrix(1:5,ncol=3,nrow=5,byrow = TRUE))
colnames(myneighborsare)<-c("id","N1","N2","N3")
#How many of each group live at each location?
groupPops<-data.table(id=1:5,matrix(floor(runif(25,min=0,max=10)),ncol=5,nrow=5))
colnames(groupPops)<-c("id","ape","bat","cat","dog","eel")

whoamI
howmanyneighbors
myneighborsare
groupPops

> whoamI
   id
1:  1
2:  2
3:  3
4:  4
5:  5
> howmanyneighbors
   id toCount
1:  1       2
2:  2       1
3:  3       3
4:  4       3
5:  5       2
> myneighborsare
   id N1 N2 N3
1:  1  1  2  3
2:  2  4  5  1
3:  3  2  3  4
4:  4  5  1  2
5:  5  3  4  5
> groupPops
   id ape bat cat dog eel
1:  1   9   8   6   8   1
2:  2   9   8   0   9   8
3:  3   6   1   9   1   2
4:  4   6   1   9   0   3
5:  5   6   2   2   2   5
##At any given time I will only want the group populations for some of the groups
#I will always want 'ape' but other groups will vary. Here I have picked two
#I retain this because passing the column names by variable along with the pass of 'ape' was tricky
#and I don't want to lose that syntax in any new answer
animals<-c("bat","eel")
i<-2 #similarly, howmanyneighbors has many more columns in it and I need to pass a reference to one of them which I call i here


##Functions I will call on the above data
#Get the ids of my neighbors from myneighborsare. The number of ids returned will vary based on value in howmanyneighbors
getIDs<-function(a){myneighborsare[id==a,2:(as.numeric(howmanyneighbors[id==a,..i])+1)]} #so many coding fails here it pains me to put this in public view
#Sum the populations of my neighbors for groups I am interested in.
sumVals<-function(b){colSums(groupPops[id%in%b,c("ape",..animals)])} #cringe
#Wrap the first two together and put them into a format that works well with being returned as a row in a data.table
doBoth<-function(a){
  ro.ws<-getIDs(a)
  su.ms<-sumVals(ro.ws)
  answer<-lapply(split(su.ms,names(su.ms)),unname) #not too worried about this as it just mimics some things that happen in the original code at little time cost
  return(answer)
}

#Run the above function on my data
result<-data.table(whoamI)
result[,doBoth(id),by=id]

   id ape bat eel
1:  1  18  16   9
2:  2   6   1   3
3:  3  21  10  13
4:  4  24  18  14
5:  5  12   2   5

1 个答案:

答案 0 :(得分:1)

这涉及重塑和非等距联接。

library(data.table)

# reshape to long and add a grouping ID for a non-equi join later
molten_neighbors <- melt(myneighborsare, id.vars = 'id')[, grp_id := .GRP, by = variable]

#regular join by id
whoamI[howmanyneighbors,
       on = .(id)
#non-equi join - replaces getIDs(a)     
       ][molten_neighbors,
         on = .(id, toCount >= grp_id),
         nomatch = 0L
#regular join - next steps replace sumVals(ro.ws)        
         ][groupPops[, c('id','ape', ..animals)],
           on = .(value = id),
           .(id, ape, bat, eel),
           nomatch = 0L,

           ][,
             lapply(.SD, sum),
             keyby = id 
             ]

我强烈建议简化以后的问题。使用10行可让您在问题中发布表格。照原样,这有点难以理解。