Spread()将非唯一值放入新列中

时间:2019-06-21 08:21:13

标签: r dataframe reshape tidyr spread

我有一些看起来像这样的数据(末尾用于输入的代码):

#>           artist          album year  source                     id
#> 1        Beatles  Sgt. Pepper's 1967  amazon             B0025KVLTM
#> 2        Beatles  Sgt. Pepper's 1967 spotify 6QaVfG1pHYl1z15ZxkvVDW
#> 3        Beatles  Sgt. Pepper's 1967  amazon             B06WGVMLJY
#> 4 Rolling Stones Sticky Fingers 1971 spotify 29m6DinzdaD0OPqWKGyMdz

我想修复“ id”列(其中包括来自多个来源的ID,如“源”列所示。

这应该是一个简单的spread(),但是复杂的是,有时我们从完全相同的来源获得重复的ID:请参见上面的第1行和第3行。

是否有一种简便的方法来进行spread()并将重复的ID放在新列中?

我期望的结果是:


#>           artist          album year  source  amazon_id amazon_id_2
#> 1        Beatles  Sgt. Pepper's 1967  amazon B0025KVLTM  B06WGVMLJY
#> 2 Rolling Stones Sticky Fingers 1971 spotify       <NA>        <NA>
#>                  spotify
#> 1 6QaVfG1pHYl1z15ZxkvVDW
#> 2 29m6DinzdaD0OPqWKGyMdz

以下代码用于输入示例数据:

df <- data.frame(stringsAsFactors=FALSE,
      artist = c("Beatles", "Beatles", "Beatles", "Rolling Stones"),
       album = c("Sgt. Pepper's", "Sgt. Pepper's", "Sgt. Pepper's",
                 "Sticky Fingers"),
        year = c(1967, 1967, 1967, 1971),
      source = c("amazon", "spotify", "amazon", "spotify"),
          id = c("B0025KVLTM", "6QaVfG1pHYl1z15ZxkvVDW", "B06WGVMLJY",
                 "29m6DinzdaD0OPqWKGyMdz")
)
df

4 个答案:

答案 0 :(得分:3)

这可以使用dcast中的data.table一行(加长)来完成。但是我认为这很优雅。

library(data.table)
dcast(df, artist + album + year ~ paste(source, rowid(artist, source), sep = "_"))
#          artist          album year   amazon_1   amazon_2              spotify_1
#1        Beatles  Sgt. Pepper's 1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
#2 Rolling Stones Sticky Fingers 1971       <NA>       <NA> 29m6DinzdaD0OPqWKGyMdz

答案 1 :(得分:2)

一种可能是:

df %>%
 group_by(artist, album, year, source) %>%
 mutate(source2 = paste(source, row_number(), sep = "_")) %>%
 spread(source2, id) %>%
 ungroup()

  artist         album           year source  amazon_1   amazon_2   spotify_1             
  <chr>          <chr>          <dbl> <chr>   <chr>      <chr>      <chr>                 
1 Beatles        Sgt. Pepper's   1967 amazon  B0025KVLTM B06WGVMLJY <NA>                  
2 Beatles        Sgt. Pepper's   1967 spotify <NA>       <NA>       6QaVfG1pHYl1z15ZxkvVDW
3 Rolling Stones Sticky Fingers  1971 spotify <NA>       <NA>       29m6DinzdaD0OPqWKGyMdz

请注意,这里的输出由三行组成,因为spotify是甲壳虫乐队专辑的唯一“来源”。

但是,如果您想要两行,则可以执行以下操作:

df %>%
 group_by(artist, album, year, source) %>%
 mutate(source2 = paste(source, row_number(), sep = "_")) %>%
 ungroup() %>%
 select(-source) %>%
 spread(source2, id) 

  artist         album           year amazon_1   amazon_2   spotify_1             
  <chr>          <chr>          <dbl> <chr>      <chr>      <chr>                 
1 Beatles        Sgt. Pepper's   1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
2 Rolling Stones Sticky Fingers  1971 <NA>       <NA>       29m6DinzdaD0OPqWKGyMdz

如果您还想要“来源”列:

df %>%
 group_by(artist, album, year, source) %>%
 mutate(source2 = paste(source, row_number(), sep = "_")) %>%
 group_by(artist, album, year) %>%
 mutate(source = toString(unique(source))) %>%
 spread(source2, id) %>%
 ungroup()

  artist         album           year source          amazon_1  amazon_2  spotify_1            
  <chr>          <chr>          <dbl> <chr>           <chr>     <chr>     <chr>                
1 Beatles        Sgt. Pepper's   1967 amazon, spotify B0025KVL… B06WGVML… 6QaVfG1pHYl1z15ZxkvV…
2 Rolling Stones Sticky Fingers  1971 spotify         <NA>      <NA>      29m6DinzdaD0OPqWKGyM…

答案 2 :(得分:2)

在基数R中也可以使用avereshape

df$source <- with(df, paste(source, 
                            ave(artist, source, FUN=function(i) 
                              cumsum(duplicated(i)) + 1)), sep="_")
reshape(df, timevar="source", idvar=c("artist", "album", "year"), direction="wide")
#           artist          album year id.amazon_1           id.spotify_1 id.amazon_2 id.amazon_3
# 1        Beatles  Sgt. Pepper's 1967  B0025KVLTM 6QaVfG1pHYl1z15ZxkvVDW  B06WGVMLJY     SoMeFoO
# 4 Rolling Stones Sticky Fingers 1971        <NA> 29m6DinzdaD0OPqWKGyMdz        <NA>        <NA>

数据

df <- structure(list(artist = c("Beatles", "Beatles", "Beatles", "Rolling Stones"
), album = c("Sgt. Pepper's", "Sgt. Pepper's", "Sgt. Pepper's", 
"Sticky Fingers"), year = c(1967, 1967, 1967, 1971), source = c("amazon", 
"spotify", "amazon", "spotify"), id = c("B0025KVLTM", "6QaVfG1pHYl1z15ZxkvVDW", 
"B06WGVMLJY", "29m6DinzdaD0OPqWKGyMdz")), class = "data.frame", row.names = c(NA, 
-4L))
df <- rbind(df, df[1, ])
df[5, 5] <- "SoMeFoO"

答案 3 :(得分:1)

这是一种方法。

df %>% 
  group_by(artist,source) %>%  
  mutate(rownum = row_number()) %>% 
  unite(source, source, rownum, sep="_") %>% 
  spread(source,id)

# A tibble: 2 x 6
# Groups:   artist [2]
  artist         album           year amazon_1   amazon_2   spotify_1             
  <chr>          <chr>          <dbl> <chr>      <chr>      <chr>                 
1 Beatles        Sgt. Pepper's   1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
2 Rolling Stones Sticky Fingers  1971 NA         NA         29m6DinzdaD0OPqWKGyMdz