我有一个图书类型的数据框。它从两列开始,一列是标题,一列是包含多种流派的字符串,类似于:
titles <- c("Harry Potter 1", "To Kill A Mockingbird", "The Hunger Games 1")
genres <- c("Fantasy, Young Adult, Fantasy, Magic", "Classics, Fiction, Historical, Historical Fiction, Academic", "Young Adult, Fiction, Science Fiction, Dystopia, Science Fiction")
books <- tibble(
title = titles,
genre = genres)
books
# A tibble: 3 x 2
title genre
<chr> <chr>
1 Harry Potter 1 Fantasy, Young Adult, Fantasy, Magic
2 To Kill A Mockingbird Classics, Fiction, Historical, Historical Fiction, Academic
3 The Hunger Games 1 Young Adult, Fiction, Science Fiction, Dystopia, Science Fiction
当前,流派是按照将这些人归为该流派的人数排序的。我想将类型字符串拆分为多个列,以指示主要类型,次要类型等,但删除重复项。将流派分成多列很容易,而且我敢肯定,有某种方法可以使像unique()这样的函数按行工作并省略重复项,但是我被困住了。所需的输出将是这样的:
# A tibble: 3 x 6
title genre1 genre2 genre3 genre4 genre5
<chr> <chr> <chr> <chr> <chr> <chr>
1 Harry Potter and the Sorcerors Stone Fantasy Young Adult Magic NA NA
2 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
3 The Hunger Games Young Adult Fiction Science Fiction Dystopia NA
答案 0 :(得分:1)
您可以使用stringr::str_split
进行此操作,以创建类别列表。 genre
将成为字符向量的列表,您可以将其嵌套,然后进行不同的观察。
library(tidyverse)
books %>%
mutate(genre = str_split(genre, ", ")) %>%
unnest(genre) %>%
distinct()
#> # A tibble: 12 x 2
#> title genre
#> <chr> <chr>
#> 1 Harry Potter 1 Fantasy
#> 2 Harry Potter 1 Young Adult
#> 3 Harry Potter 1 Magic
#> 4 To Kill A Mockingbird Classics
#> 5 To Kill A Mockingbird Fiction
#> 6 To Kill A Mockingbird Historical
#> 7 To Kill A Mockingbird Historical Fiction
#> 8 To Kill A Mockingbird Academic
#> 9 The Hunger Games 1 Young Adult
#> 10 The Hunger Games 1 Fiction
#> 11 The Hunger Games 1 Science Fiction
#> 12 The Hunger Games 1 Dystopia
这里我经常忘记的快捷方式是separate_rows
,它一步一步地进行拆分和取消嵌套:
books %>%
separate_rows(genre, sep = ", ") %>%
distinct()
等同于上一个区块。
要将其转换为宽格式,可以使用tidyr::spread
。为了动态创建列名"genre1"
,"genre2"
等,我按标题分组,然后为每个标题编号唯一的流派。这样,您无需知道需要多少种类型的列,就像您使用tidyr::separate
来拆分该列一样。
books %>%
mutate(genre = str_split(genre, ", ")) %>%
unnest(genre) %>%
distinct() %>%
group_by(title) %>%
mutate(num = row_number() %>% paste0("genre", .)) %>%
spread(key = num, value = genre)
#> # A tibble: 3 x 6
#> # Groups: title [3]
#> title genre1 genre2 genre3 genre4 genre5
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
#> 2 The Hunger Games 1 Young Adult Fiction Science … Dystopia <NA>
#> 3 To Kill A Mockingbird Classics Fiction Historic… Historic… Acade…
由reprex package(v0.2.0)于2018-06-29创建。
答案 1 :(得分:1)
在使用$
之前,您可以使用separate
来删除非唯一流派。
separate
答案 2 :(得分:1)
这是使用data.table
和基数R的解决方案。
library(data.table)
setDT(books)
books = unique(books[, strsplit(genre, ", "), by = title])
books[, genre:= paste0("genre_", seq_along(V1)), by = title]
dcast(books, title ~ genre, value.var = "V1")
# title genre_1 genre_2 genre_3 genre_4 genre_5
# 1: Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
# 2: The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia <NA>
# 3: To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
答案 3 :(得分:1)
我们可以将列粘贴在一起并使用data.table::fread
魔术,然后重命名我们的字段。
library(data.table)
dt <- fread(paste(books$title, books$genre, sep=", ",collapse="\n"),header = FALSE,fill=TRUE,sep=",")
setNames(as.data.frame(dt),c("title",paste0("genre",seq(ncol(dt)-1))))
# title genre1 genre2 genre3 genre4 genre5
# 1 Harry Potter 1 Fantasy Young Adult Fantasy Magic
# 2 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
# 3 The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia Science Fiction
答案 4 :(得分:0)
处理长格式数据总是更好。因此,一种选择是使用tidyr::gather
更改长格式的数据,然后在将数据转换回wide-format
之前删除重复项。
library(tidyverse)
library(splitstackshape)
books %>% cSplit("genre") %>% mutate_if(is.factor, as.character) %>%
gather(key, value, - title) %>% distinct(title, value) %>%
group_by(title) %>%
mutate(key = paste0("genre",row_number())) %>%
spread(key, value) %>% as.data.frame()
# title genre1 genre2 genre3 genre4 genre5
# 1 Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
# 2 The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia <NA>
# 3 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic