如何使用字符值连接/合并两个表?

时间:2015-07-09 12:19:16

标签: r

我想基于名字,姓氏和年份组合两个表,并创建一个新的二进制变量,指示表1中的行是否存在于第二个表中。

第一张桌子是一个赛季NBA球员某些属性的面板数据集:

   firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
   lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
   year<-c("1991","1992","1993","1991","1992","1993","1992","1992")

   season<-data.frame(firstname,lastname,year)


    firstname   lastname        year
  1 Michael      Jordan         1991
  2 Michael      Jordan         1992
  3 Michael      Jordan         1993
  4 Magic        Johnson        1991
  5 Magic        Johnson        1992
  6 Magic        Johnson        1993
  7 Larry        Bird           1992
  8 Larry        Bird           1992

第二个data.frame是参加全明星赛的NBA球员的一些属性的面板数据集:

   firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
   lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
   year<-c("1991","1992","1993","1991","1992","1993")

    ALLSTARS<-data.frame(firstname,lastname,year)



     firstname  lastname    year
  1 Michael     Jordan    1991
  2 Michael     Jordan    1992
  3 Michael     Jordan    1993
  4 Magic       Johnson   1991
  5 Magic       Johnson   1992
  6 Magic       Johnson   1993

我想要的结果如下:

  firstname lastname    year    allstars

   1    Michael Jordan  1991    1
   2    Michael Jordan  1992    1
   3    Michael Jordan  1993    1
   4    Magic   Johnson 1991    1
   5    Magic   Johnson 1992    1
   6    Magic   Johnson 1993    1
   7    Larry   Bird    1992    0
   8    Larry   Bird    1992    0

我尝试使用左连接。但不确定这是否有意义:

    test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")

3 个答案:

答案 0 :(得分:4)

这是一个使用data.table二进制连接的简单解决方案,它允许您在加入时通过引用更新列

library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
#    firstname lastname year allstars
# 1:     Larry     Bird 1992       NA
# 2:     Larry     Bird 1992       NA
# 3:     Magic  Johnson 1991        1
# 4:     Magic  Johnson 1992        1
# 5:     Magic  Johnson 1993        1
# 6:   Michael   Jordan 1991        1
# 7:   Michael   Jordan 1992        1
# 8:   Michael   Jordan 1993        1

或使用dplyr

library(dplyr)
ALLSTARS %>% 
  mutate(allstars = 1L) %>%
  right_join(., season)
#   firstname lastname year allstars
# 1   Michael   Jordan 1991        1
# 2   Michael   Jordan 1992        1
# 3   Michael   Jordan 1993        1
# 4     Magic  Johnson 1991        1
# 5     Magic  Johnson 1992        1
# 6     Magic  Johnson 1993        1
# 7     Larry     Bird 1992       NA
# 8     Larry     Bird 1992       NA

答案 1 :(得分:2)

在基地R:

ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L 
newdf

或者我喜欢采用不同的方法:

season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
# 
#   firstname lastname year allstars
# 1   Michael   Jordan 1991        1
# 2   Michael   Jordan 1992        1
# 3   Michael   Jordan 1993        1
# 4     Magic  Johnson 1991        1
# 5     Magic  Johnson 1992        1
# 6     Magic  Johnson 1993        1
# 7     Larry     Bird 1992        0
# 8     Larry     Bird 1992        0

答案 2 :(得分:1)

看起来您正在使用plyr包中的join()。你几乎就在那里:只需用ALLSTARS$allstars <- 1作为命令的序言。然后在写入时进行连接,最后将NA值转换为0.所以:

ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0

结果:

  firstname lastname year allstars
1   Michael   Jordan 1991        1
2   Michael   Jordan 1992        1
3   Michael   Jordan 1993        1
4     Magic  Johnson 1991        1
5     Magic  Johnson 1992        1
6     Magic  Johnson 1993        1
7     Larry     Bird 1992        0
8     Larry     Bird 1992        0

虽然我个人会使用dplyr软件包中的left_joinright_join,例如David的答案,而不是plyr的join()。另请注意,在这种情况下,您实际上并不需要by join()参数,因为默认情况下,该函数会尝试使用通用名称连接所有字段,这就是您想要的。