使用str_split并从R中的同一记录中删除重复的值

时间:2018-06-29 20:12:57

标签: r stringr

我有一个图书类型的数据框。它从两列开始,一列是标题,一列是包含多种流派的字符串,类似于:

titles <- c("Harry Potter 1", "To Kill A Mockingbird", "The Hunger Games 1")
genres <- c("Fantasy, Young Adult, Fantasy, Magic", "Classics, Fiction, Historical, Historical Fiction, Academic", "Young Adult, Fiction, Science Fiction, Dystopia, Science Fiction")
books <- tibble(
  title = titles,
  genre = genres)
books

# A tibble: 3 x 2
  title                 genre                                                           
  <chr>                 <chr>                                                           
1 Harry Potter 1        Fantasy, Young Adult, Fantasy, Magic                            
2 To Kill A Mockingbird Classics, Fiction, Historical, Historical Fiction, Academic     
3 The Hunger Games 1    Young Adult, Fiction, Science Fiction, Dystopia, Science Fiction  

当前,流派是按照将这些人归为该流派的人数排序的。我想将类型字符串拆分为多个列,以指示主要类型,次要类型等,但删除重复项。将流派分成多列很容易,而且我敢肯定,有某种方法可以使像unique()这样的函数按行工作并省略重复项,但是我被困住了。所需的输出将是这样的:

# A tibble: 3 x 6
  title                                genre1      genre2      genre3          genre4             genre5  
  <chr>                                <chr>       <chr>       <chr>           <chr>              <chr>   
1 Harry Potter and the Sorcerors Stone Fantasy     Young Adult Magic           NA                 NA      
2 To Kill A Mockingbird                Classics    Fiction     Historical      Historical Fiction Academic
3 The Hunger Games                     Young Adult Fiction     Science Fiction Dystopia           NA  

5 个答案:

答案 0 :(得分:1)

您可以使用stringr::str_split进行此操作,以创建类别列表。 genre将成为字符向量的列表,您可以将其嵌套,然后进行不同的观察。

library(tidyverse)

books %>%
  mutate(genre = str_split(genre, ", ")) %>%
  unnest(genre) %>%
  distinct()
#> # A tibble: 12 x 2
#>    title                 genre             
#>    <chr>                 <chr>             
#>  1 Harry Potter 1        Fantasy           
#>  2 Harry Potter 1        Young Adult       
#>  3 Harry Potter 1        Magic             
#>  4 To Kill A Mockingbird Classics          
#>  5 To Kill A Mockingbird Fiction           
#>  6 To Kill A Mockingbird Historical        
#>  7 To Kill A Mockingbird Historical Fiction
#>  8 To Kill A Mockingbird Academic          
#>  9 The Hunger Games 1    Young Adult       
#> 10 The Hunger Games 1    Fiction           
#> 11 The Hunger Games 1    Science Fiction   
#> 12 The Hunger Games 1    Dystopia

这里我经常忘记的快捷方式是separate_rows,它一步一步地进行拆分和取消嵌套:

books %>%
  separate_rows(genre, sep = ", ") %>%
  distinct()

等同于上一个区块。

要将其转换为宽格式,可以使用tidyr::spread。为了动态创建列名"genre1""genre2"等,我按标题分组,然后为每个标题编号唯一的流派。这样,您无需知道需要多少种类型的列,就像您使用tidyr::separate来拆分该列一样。

books %>%
  mutate(genre = str_split(genre, ", ")) %>%
  unnest(genre) %>%
  distinct() %>%
  group_by(title) %>%
  mutate(num = row_number() %>% paste0("genre", .)) %>%
  spread(key = num, value = genre)
#> # A tibble: 3 x 6
#> # Groups:   title [3]
#>   title                 genre1      genre2      genre3    genre4    genre5
#>   <chr>                 <chr>       <chr>       <chr>     <chr>     <chr> 
#> 1 Harry Potter 1        Fantasy     Young Adult Magic     <NA>      <NA>  
#> 2 The Hunger Games 1    Young Adult Fiction     Science … Dystopia  <NA>  
#> 3 To Kill A Mockingbird Classics    Fiction     Historic… Historic… Acade…

reprex package(v0.2.0)于2018-06-29创建。

答案 1 :(得分:1)

在使用$之前,您可以使用separate来删除非唯一流派。

separate

答案 2 :(得分:1)

这是使用data.table和基数R的解决方案。

library(data.table)
setDT(books)

books = unique(books[, strsplit(genre, ", "), by = title])
books[, genre:= paste0("genre_", seq_along(V1)), by = title]
dcast(books, title ~ genre, value.var = "V1")
#                    title     genre_1     genre_2         genre_3            genre_4  genre_5
# 1:        Harry Potter 1     Fantasy Young Adult           Magic               <NA>     <NA>
# 2:    The Hunger Games 1 Young Adult     Fiction Science Fiction           Dystopia     <NA>
# 3: To Kill A Mockingbird    Classics     Fiction      Historical Historical Fiction Academic

答案 3 :(得分:1)

我们可以将列粘贴在一起并使用data.table::fread魔术,然后重命名我们的字段。

library(data.table)
dt <- fread(paste(books$title, books$genre, sep=", ",collapse="\n"),header = FALSE,fill=TRUE,sep=",")
setNames(as.data.frame(dt),c("title",paste0("genre",seq(ncol(dt)-1))))
#                   title      genre1      genre2          genre3             genre4          genre5
# 1        Harry Potter 1     Fantasy Young Adult         Fantasy              Magic                
# 2 To Kill A Mockingbird    Classics     Fiction      Historical Historical Fiction        Academic
# 3    The Hunger Games 1 Young Adult     Fiction Science Fiction           Dystopia Science Fiction

答案 4 :(得分:0)

处理长格式数据总是更好。因此,一种选择是使用tidyr::gather更改长格式的数据,然后在将数据转换回wide-format之前删除重复项。

library(tidyverse)
library(splitstackshape)

books %>% cSplit("genre") %>% mutate_if(is.factor, as.character) %>%
  gather(key, value, - title) %>% distinct(title, value) %>%
  group_by(title) %>%
  mutate(key = paste0("genre",row_number())) %>%
  spread(key, value) %>% as.data.frame()

#                   title      genre1      genre2          genre3             genre4   genre5
# 1        Harry Potter 1     Fantasy Young Adult           Magic               <NA>     <NA>
# 2    The Hunger Games 1 Young Adult     Fiction Science Fiction           Dystopia     <NA>
# 3 To Kill A Mockingbird    Classics     Fiction      Historical Historical Fiction Academic