我有一个数据,其中一列是genre(chr),其值类似于“戏剧|音乐|犯罪”,我需要分割此数据,并需要为每个条目创建新行,例如该值是3个值,因此我需要在该数据框中的所有列中进行三个输入。
imdbId <- "tt0118578"
title <-"Albela"
releaseYear<- 2010
releaseDate <- "2-12-2010"
genre <- "Adventure | Drama | Musical"
writers <- "Ashutosh Gowariker (story) | Ashutosh Gowariker (screenplay) |
Kumar Dave (screenplay) | Sanjay Dayma (screenplay) | K.P. Saxena
(dialogue)"
actors <-"Aamir Khan | Gracy Singh | Rachel Shelley | Paul Blackthorne"
directors<-"Ashutosh Gowariker"
sequel <-"No"
hitFlop <-2
df <- data.frame(imdbId, title, releaseYear, releaseDate, genre,
writers, actors, directors, sequel, hitFlop
, stringsAsFactors=FALSE)**
这是数据帧的str,现在我需要分割数据并根据单个流派值为每部电影制作唯一的条目。
答案 0 :(得分:0)
类似的事情可能起作用:
数据:
multiChar<-
"tt0169102
Lagaan: Once Upon a Time in India
2001
08 May 2002
Adventure | Drama | Musical
Ashutosh Gowariker (story) | Ashutosh Gowariker (screenplay) | Kumar Dave (screenplay) | Sanjay Dayma (screenplay) | K.P. Saxena (dialogue)
Aamir Khan | Gracy Singh | Rachel Shelley | Paul Blackthorne
Ashutosh Gowariker
0
6"
代码:
library(magrittr)
patterni <- "(?i)(?<=\\n).*(adventure|drama|musical)(\\s+?(\\|)?\\s+?).*(?=\\n)"
getGenres<- stringr::str_extract(multiChar, patterni) %>%
str_split("\\|",simplify = T) %>% c %>% trimws
result <- purrr::map(getGenres,~sub(patterni,.,multiChar,perl=T))
结果:
lapply(result,cat)
请注意:
您可能必须提出一种更精确的模式patterni
。
此处采用第5行(流派)。如果您的流派总是排在第五行,那就是您的模式。
patterni <- "^(.*?\\n){4}.*(?=\\n)"
getGenres<- stringr::str_extract(multiChar, patterni) %>% sub(".*\\n","",.) %>%
str_split("\\|",simplify = T) %>% c %>% trimws
答案 1 :(得分:0)
回答问题很容易...如果问题的框架合理。没有提供代码,因此我们假设一个数据框:
title <- "Lagaan: Once Upon a Time in India"
year <- 2001
genre <- "Adventure | Drama | Musical"
df <- data.frame(title, year, genre, stringsAsFactors=FALSE)
根据需要添加或复制尽可能多的行。然后根据需要替换流派列中的值。
对于单个流派名称向量:
genres <- strsplit(df$genre, " \\| ")[[1]]
有关类型名称向量的列表:
genres <- strsplit(df$genre, " \\| ")
答案 2 :(得分:0)
我创建了一个函数,该函数使用stringr拆分列,并为生成的列指定了模式和名称前缀。
**split_into_multiple <- function(column, pattern = ", ", into_prefix){
cols <- str_split_fixed(column, pattern, n = Inf)
# Sub out the ""'s returned by filling the matrix to the right, with NAs which
are useful
cols[which(cols == "")] <- NA
cols <- as.tibble(cols)
# name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ...,
'into_prefix_m'
# where m = # columns of 'cols'
m <- dim(cols)[2]
names(cols) <- paste(into_prefix, 1:m, sep = "_")
return(cols)
}**
然后我们可以在dplyr管道中使用split_into_multiple,如下所示:
**after <- BollywoodMovieDetail %>%
bind_cols(split_into_multiple(.$genre,"\\|", "genre")) %>%
# selecting those that start with 'genre_' will remove the original 'genre' column
select(imdbId, starts_with("genre_"))
> after
# A tibble: 1,284 x 4
imdbId genre_1 genre_2 genre_3
<chr> <chr> <chr> <chr>
1 tt0118578 Romance NA NA
2 tt0169102 "Adventure " " Drama " " Musical"
3 tt0187279 "Action " " Comedy" NA
4 tt0222024 "Drama " " Romance" NA
# ... with 1,274 more rows**
然后我们可以使用collect整理...
**> after %>%
+ gather(key, val, -imdbId, na.rm = T)
A tibble: 2,826 x 3
imdbId key val
* <chr> <chr> <chr>
1 tt0118578 genre_1 Romance
2 tt0169102 genre_1 "Adventure "
3 tt0187279 genre_1 "Action "
4 tt0222024 genre_1 "Drama "
5 tt0227194 genre_1 "Action "
# ... with 2,816 more rows**