Question

我正在为R

中的 for 循环寻找一个非常有效的解决方案

其中data_papers是

data_papers<-c(1,3, 47276 77012 77012 79468....)

paper_author：

   paper_id author_id
1        1    521630
2        1    972575
3        1   1528710
4        1   1611750
5        2   1682088

我需要找到paper_author中存在于data_papers中给定论文的作者。在data_papers中有大约350,000篇论文，在paper_author中有大约2,100,000篇论文。

所以我的输出将是data_paper

中paper_ids的author_id列表

authors:
 [[1]]
 [1]     521630   972575  1528710  1611710

 [[2]]
 [1]     826   338038  788465 1256860 1671245 2164912

 [[3]]
 [1]     366653 1570981 1603466

最简单的方法是

authors<-vector("list",length(data_papers))
for(i in 1:length(data_papers)){
 authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])}

但计算时间非常长

另一种选择类似于以下efficient programming in R

i=1:length(data_papers)
authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])

但我无法做到这一点。

怎么可能这样做。谢谢

Answer 1

with(paper_author, split(author_id,paper_id))

Answer 2

或者您可以使用R＆＃39; merge功能？

merge(data_papers, paper_author, by=1)

Answer 3

为什么你不能使用你提到的第二个解决方案？关于为什么有用的信息。

在任何情况下，您要做的是加入两个表（data_papers和paper_authors）。使用纯嵌套循环来完成它，就像你的示例代码在R for循环或C for循环向量操作中那样，效率非常低。您可以使用某种索引数据结构，基于例如hash包裹，但它做了很多工作。

相反，只需使用数据库。他们是为这类事情而建造的。 sqldf甚至可以让你将其嵌入到R中。

install.packages("sqldf")
require(sqldf)

#you probably want to dig into the indexing options available here as well
combined <- sqldf("select distinct author_id from paper_author pa inner join data_papers dp on dp.paper_id = pa.paper_id where dp.paper_id = 1234;")

在R中将alternate应用于for循环

3 个答案: