如何从已发表论文列表中确定作者共同出现

时间:2016-03-08 19:05:37

标签: r

我想看看我是否可以想象谁在同行评审的期刊上与谁一起出版某个科目。要做到这一点,我输入了关键字" Barrett""进入pubmed并下载了一个大文件,它给了我两列,TitleAuthor

structure(list(Title = structure(c(1L, 4L, 3L, 2L, 5L), .Label = c("A case of Barrett's adenocarcinoma with marked endoscopic morphological changes in Barrett's esophagus over a long follow-up period of 15\xe4\xf3\x8ayears.", 
"APE1-mediated DNA damage repair provides survival advantage for esophageal adenocarcinoma cells in response to acidic bile salts.", 
"Healthcare Cost of Over-Diagnosis of Low-Grade Dysplasia in Barrett's Esophagus.", 
"Radiofrequency ablation coupled with Roux-en-Y gastric bypass: a treatment option for morbidly obese patients with Barrett's esophagus.", 
"Risk factors for Barrett's esophagus."), class = "factor"), 
    Author = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("Arora Z, Garber A, Thota PN.", 
    "Hong J, Chen Z, Peng D, Zaika A, Revetta F, Washington MK, Belkhiri A, El-Rifai W.", 
    "Iwaya Y, Yamazaki T, Watanabe T, Seki A, Ochi Y, Hara E, Arakura N, Tanaka E, Hasebe O.", 
    "Lash RH, Deas TM Jr, Wians FH Jr.", "Parikh K, Khaitan L."
    ), class = "factor")), .Names = c("Title", "Author"), row.names = c(NA, 
5L), class = "data.frame")

我想计算一位作者与另一位作者发表过多少次。我认为最好的方法是创建一个共同存储矩阵(后来我将使用igraph)。

我在理解如何将数据转换为这样的矩阵时遇到了一些问题。我想这将涉及将所有作者列为列名称和行名称,然后迭代Auth数据帧的每一行并记录矩阵中两个名称的共同出现。有没有快速的方法来做到这一点。我迷失了如何处理这个问题。所以我尝试了这个:

1.Extract all the names into a long list from the Author column
2.Then create colnames from the Author list
3.Then create rownames from the Author list
4.Then somehow iterate through Auth[2] and count the name co-occurrence

...但是我在第一次尝试时遇到了困难:

AuthSplit<-strsplit(Auth$Author, ",", fixed=T)
AuthSplit<-as.data.frame(AuthSplit)

但是我收到了这个错误:

 Error in data.frame(c("Iwaya Y", " Yamazaki T", " Watanabe T", " Seki A",  : 
  arguments imply differing number of rows: 9, 2, 3, 8, 20, 5, 1, 11, 4, 23, 6, 15, 16, 7, 12, 10, 14, 21, 13, 18, 19, 17, 22

必须有一种更简单的方法吗?

1 个答案:

答案 0 :(得分:2)

如果您有大量作者,则邻接矩阵可能非常大。相反,您可以创建igraph可用于创建图表的作者对列表。基本方法是为每篇论文形成单个作者的向量列表,然后为每篇论文创建一对作者对的数据框,仅保留第一作者在第二作者之前按字母顺序排列的那些。然后将每篇论文的数据帧列表组合成一个更大的数据帧。形成具有唯一作者对的数据框和每个作者对的论文数。此数据框用于创建图形,其中纸张计数存储在带有边缘定义的图形中。此计数可以显示在图表的图表上。

我在您的列表中添加了几篇论文,其中包括同一作者出现在多篇论文中的案例,同一对论文是多篇论文的作者。代码看起来像

library(igraph)
#  add papers with authors from previous papers
  Auth <- rbind(Auth, 
              data.frame(Title=c("Paper A","Paper B"), 
                         Author=c("Iwaya Y, Parikh K, Lash RH", "Wians FH Jr., Lash RH")))

# create list of individual authors for each paper
  pub_auths <- sapply(Auth$Author, function(x) strsplit(as.character(x), split=","))
  pub_auths <- lapply(pub_auths, trimws)
# for each paper, form a data frame of unique author pairs 
  auth_pairs <- lapply(pub_auths, function(x) { z  <-  expand.grid(x, x, stringsAsFactors=FALSE);
                                        z[z$Var1 < z$Var2,]   })
# combine list of matrices for each paper into one data frame
  auth_pairs <- do.call(rbind, auth_pairs)
# count papers for each author pair
  auth_count <- aggregate( paste(Var1, Var2)  ~ Var1 + Var2 , data=auth_pairs, length)
  colnames(auth_count) <- c("Author1","Author2","Paper_count")
# create graph from author pairs
  g <- graph_from_data_frame(auth_count, directed=FALSE)
# plot graph
   plot(g, edge.label=E(g)$Paper_count, edge.label.cex=1.4, vertex.label.cex=1.4)

在图中,纸张计数显示为边缘标签。请注意,WiansLash有两篇论文,其中包括添加到数据中的论文。

enter image description here