连接两个数据框,以便一列包含多个值

时间:2019-06-19 10:41:19

标签: r dplyr tidyverse tidyr spread

我的数据如下:

df1
#>           Artist          Album Year
#> 1        Beatles  Sgt. Pepper's 1967
#> 2 Rolling Stones Sticky Fingers 1971

df2
#>    Artist Members
#> 1 Beatles  George
#> 2 Beatles   Ringo
#> 3 Beatles    Paul
#> 4 Beatles    John

我想加入这两个df,我认为这是一种“愚蠢”的方式。尽管不怎么整齐,但使最终输出看起来像下面的示例对我来说将是非常有帮助的,在该示例中,每个乐队(艺术家)仅占一行,乐队成员全部放在一列中,并用逗号隔开:

Desired Output
#>           Artist          Album                   Members Year
#> 1        Beatles  Sgt. Pepper's George, Ringo, Paul, John 1967
#> 2 Rolling Stones Sticky Fingers                           1971

我已经能够接近一个解决方案(如下),但是:

  1. 有更简单的方法吗?
  2. 如何概括我的代码,以便如果某个乐队有11位成员或13位成员,那么该代码仍然有效?
  3. 当缺少数据时,例如滚石,值是“ NA”。使它们空白很容易吗?
library(tidyverse)
df1 <- data.frame(stringsAsFactors=FALSE,
      Artist = c("Beatles", "Rolling Stones"),
       Album = c("Sgt. Pepper's", "Sticky Fingers"),
        Year = c(1967, 1971)
)

df2 <- data.frame(stringsAsFactors=FALSE,
       Artist = c("Beatles", "Beatles", "Beatles", "Beatles"),
    Members = c("George", "Ringo", "Paul", "John")
)

df <- left_join(df1, df2, by = "Artist")
df <- df %>% group_by(Artist) %>% mutate(member_number = seq_along(Members))
df <- spread(df, key = "member_number", value = "Members", sep = "_")
df <- df %>% unite(col = "members", member_number_1:member_number_4, sep = ",")

哪个给出输出

df
#> # A tibble: 2 x 4
#> # Groups:   Artist [2]
#>   Artist         Album           Year members               
#>   <chr>          <chr>          <dbl> <chr>                 
#> 1 Beatles        Sgt. Pepper's   1967 George,Ringo,Paul,John
#> 2 Rolling Stones Sticky Fingers  1971 NA,NA,NA,NA

4 个答案:

答案 0 :(得分:3)

稍有不同:

library(dplyr)


 left_join(df1, df2) %>% 
    group_by(Artist, Album, Year) %>% 
    summarise(members = paste(Members, collapse = ","))

# A tibble: 2 x 4
# Groups:   Artist, Album [?]
  Artist         Album           Year members               
  <chr>          <chr>          <dbl> <chr>                 
1 Beatles        Sgt. Pepper's   1967 George,Ringo,Paul,John
2 Rolling Stones Sticky Fingers  1971 NA  

答案 1 :(得分:2)

我们可以先left_join然后再summarise多列并将它们折叠为unique逗号分隔的字符串。

library(dplyr)

left_join(df1, df2, by = "Artist") %>%
   group_by(Artist) %>%
   summarise_at(vars(Album:Members), ~toString(unique(.)))

# A tibble: 2 x 4
#  Artist         Album          Year  Members                  
#  <chr>          <chr>          <chr> <chr>                    
#1 Beatles        Sgt. Pepper's  1967  George, Ringo, Paul, John
#2 Rolling Stones Sticky Fingers 1971  NA                       

答案 2 :(得分:2)

使用data.table

library(data.table)
setDT(df2)[df1, on = .(Artist)][, .(members = toString(Members)),
   .(Artist, Album, Year)]
#          Artist          Album Year                   members
#1:        Beatles  Sgt. Pepper's 1967 George, Ringo, Paul, John
#2: Rolling Stones Sticky Fingers 1971                        NA

答案 3 :(得分:0)

我的软件包 safejoin 允许通过联接变量对联接表进行聚合操作:

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
df1 %>% eat(df2, .agg = toString)
# Joining, by = "Artist"
#           Artist          Album Year                   Members
# 1        Beatles  Sgt. Pepper's 1967 George, Ringo, Paul, John
# 2 Rolling Stones Sticky Fingers 1971                      <NA>