我有两个数据集
merge_paper_author:
author_id paper_id
731 136
731 198
778 241
........
paper_author:
paper_id author_id
136 731
136 726
136 728
138 731
138 726
138 777
138 798
140 728
140 726
140 731
在paper_id 136的merge_paper_author中,author_id是731.Now我需要取这个paper_id = 136和author_id = 731的值并查看paper_author表。对于paper_id = 136我有3位作者,即author_id = 731,726和728为此,我将有2个author_ids组合(考虑到731固定) - 731,726和731,728。
现在我需要在paper_author的所有论文中查看这个组合并找到这样的组合并计算它们。例如,对于paper_id = 138,我有3个author_ids = 731,777,726。在这里有1个计数731和726.In paper_id = 140有author_id = 731,728和726的组合,所以在这种情况下计数为2(731,726和731和728)。所以总计数是2 + 1 = 3.i也需要出现这种情况的次数。
我需要为merge_paper_author中的每个这样的author_id,paper_id组合做。我编写了一段代码但是非常耗时。我可以用R或sql查询。我的代码是:
count_multiple_sin_pap<-vector()
co_authored_comb_sin_pap<-vector()
no_coauthors_comb<-vector()
data_papers<-merge_paper_author$paper_id
data_authors_sin<-merge_paper_author$author_id
data_i<-vector()
for(i in 1:length(data_papers))
{
index<-which(paper_author$paper_id%in%data_papers[i])
if(length(index)>1){
authors<-paper_author$author_id[index]
if(length(unique(authors))==1){data_i<-rbind(data_i,i)}
else{
index1<-which(paper_author$author_id%in%data_authors_sin[i])
author<-paper_author$author_id[index1]
index2<-which(authors==data_authors_sin[i])
authors<-authors[-index2]
index3<-which(paper_author$author_id%in%data_authors_sin[i])
index4<-which(paper_author$paper_id%in%paper_author$paper_id[index3])
index5<-which(paper_author$author_id[index4]%in%authors)
dat<-paper_author$author_id[index4][index5]
dat<-data.frame(dat)
agg<-aggregate(list(numdup=rep(1,nrow(dat))), dat, length)
count_multiple_sin_pap[i]<-length(which(agg$numdup>1))
co_authored_comb_sin_pap[i]<-sum(agg$numdup)
no_coauthors_comb[i]<-nrow(agg)
}}else{count_multiple_sin_pap[i]<-0
co_authored_comb_sin_pap[i]<-0
no_coauthors_comb[i]<-0
}
}
在上面的代码中,我采用了两个输入data_papers和data_authors_sin
dput(data_papers):
1696062L, 1696062L, 1701471L, 1701471L, 1707481L, 1707481L, 1722811L,
1722811L, 1754307L, 1754307L, 1768287L, 1768287L, 1791496L, 1791496L,
1795108L, 1795108L, 1816428L, 1816428L, 1844081L, 1844081L, 1850488L,
1856769L, 1856769L, 1931620L, 1931620L, 1947922L, 1955753L, 1955753L,
dput(data_authors_sin):
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L, 2292903L,
由于