在r中自联接数据帧

时间:2017-05-08 16:40:56

标签: sql r join dataframe

我有一个由publication_id和作者姓名组成的表

enter image description here

我想找到每个作者的所有共同作者,即谁都在一起工作。

我能够获得每位作者所获得的所有出版物

   pubsperauthor <- sample_pubs_small %>% 
       group_by(cname) %>% 
       summarise(pubs = toString(sort(unique(publication_id))))

enter image description here

现在我想获得该酒吧的所有共同作者的名字。有什么建议吗?

以下是数据的代码

> dput(pubsperauthor)
structure(list(cname = c("AMEY S BAILEY", "JACK  SMITH", "JACK A SMITH", 
"JACK B SMITH", "JAMES  ANDERSON", "JAMES M ANDERSON", "RONALD  VALE", 
"RONALD A VALE", "RONALD DAVID VALE"), pubs = c("6", "6", "4", 
"5", "1, 2", "4, 5, 6", "3", "0", "1, 2")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L), .Names = c("cname", 
"pubs"))
> dput(sample_pubs_small)
structure(list(publication_id = c(0L, 1L, 1L, 2L, 2L, 3L, 4L, 
4L, 5L, 5L, 6L, 6L, 6L), cname = c("RONALD A VALE", "JAMES  ANDERSON", 
"RONALD DAVID VALE", "JAMES  ANDERSON", "RONALD DAVID VALE", 
"RONALD  VALE", "JAMES M ANDERSON", "JACK A SMITH", "JAMES M ANDERSON", 
"JACK B SMITH", "JAMES M ANDERSON", "AMEY S BAILEY", "JACK  SMITH"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
), .Names = c("publication_id", "cname"))

修改

以下是示例输出

1     AMEY S BAILEY                          JACK  SMITH, JAMES M ANDERSON
2       JACK  SMITH                        AMEY S BAILEY, JAMES M ANDERSON
3      JACK A SMITH                                       JAMES M ANDERSON
4      JACK B SMITH                                       JAMES M ANDERSON
5   JAMES  ANDERSON                                      RONALD DAVID VALE
6  JAMES M ANDERSON AMEY S BAILEY, JACK  SMITH, JACK A SMITH, JACK B SMITH
7 RONALD DAVID VALE                                        JAMES  ANDERSON
8     RONALD A VALE
9       RONALD VALE

3 个答案:

答案 0 :(得分:3)

这是获取每位作者的共同作者列表的一种方法。请注意,它删除了没有共同作者的作者。因此,根据您所需的最终数据结构,您可能希望使用完整的作者列表进行另一次联接。

coauthor <- sample_pubs_small %>% 
  left_join(sample_pubs_small, by = "publication_id") %>% 
  subset(cname.x != cname.y) %>% 
  group_by(cname.x) %>% 
  summarise(Coauthors = toString(sort(unique(cname.y)))) 

答案 1 :(得分:1)

以下是如何让作者与dplyr没有合作者。

library(dplyr)
sample_pubs_small%>%
left_join(sample_pubs_small, by="publication_id") %>%
mutate(cname.y=ifelse(cname.x==cname.y,NA,cname.y)) %>%
group_by(cname.x)%>%
summarise(coauthors = toString(sort(unique(cname.y))))

            cname.x                                              coauthors
              <chr>                                                  <chr>
1     AMEY S BAILEY                          JACK  SMITH, JAMES M ANDERSON
2       JACK  SMITH                        AMEY S BAILEY, JAMES M ANDERSON
3      JACK A SMITH                                       JAMES M ANDERSON
4      JACK B SMITH                                       JAMES M ANDERSON
5   JAMES  ANDERSON                                      RONALD DAVID VALE
6  JAMES M ANDERSON AMEY S BAILEY, JACK  SMITH, JACK A SMITH, JACK B SMITH
7      RONALD  VALE                                                       
8     RONALD A VALE                                                       
9 RONALD DAVID VALE                                        JAMES  ANDERSON

答案 2 :(得分:1)

Ian Wesley已经给出了答案,但我会添加一些有用的东西。

您可以使用aggregate获取pubsperauthor

pubsperauthor <- aggregate(publication_id ~ cname, sample_pubs_small, c)

你可以做同样的事情来获得authorsperpub,这将给每个出版物的所有作者(在某种程度上是共同作者)

authorsperpub <- aggregate(cname ~ publication_id, sample_pubs_small, c)