从csv文件中读取输入会给我留下一个包含多个值的奇数字段,例如
Title Genres
1 A [Item1, Item2, Item3]
2 B
3 C [Item4, Item1]
df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"),
stringsAsFactors = FALSE)
colnames(df) <- c("Title","Genres")
获取单个令牌的功能
extractGenre <- function(genreVector){
strsplit(substring(genreVector,2,nchar(genreVector)-1),", ")
}
我对如何将第1项,...第4项转换为因子并将其附加到数据框感到有些困惑。尽管apply使我在每一行上执行该函数,但下一步将如何?
答案 0 :(得分:1)
我不确定这是否正是您要寻找的东西,但是我的处理方式有所不同。我使用了dplyr和grepl:
df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"),
stringsAsFactors = FALSE)
colnames(df) <- c("Title","Genres")
df
df1<-df%>%
mutate(Item1 = ifelse(grepl("Item1",Genres), T,F),
Item2 = ifelse(grepl("Item2",Genres), T,F),
Item3 = ifelse(grepl("Item3",Genres), T,F),
Item4 = ifelse(grepl("Item4",Genres), T,F))
Title Genres Item1 Item2 Item3 Item4
1 A [Item1, Item2, Item3] TRUE TRUE TRUE FALSE
2 B FALSE FALSE FALSE FALSE
3 C [Item4, Item1] TRUE FALSE FALSE TRUE
希望这会有所帮助
答案 1 :(得分:1)
library(dplyr)
library(tidyr)
df %>% mutate(Genres=gsub('\\[|\\]|\\s+','',Genres)) %>% #remove []
separate(Genres,paste0('Gen',1:3)) %>% #separate Genres to multiple columns
gather(key,Genres,-Title) %>% select(-key) %>% #Gather to Genres columns
filter(!is.na(Genres)) %>% arrange(Title,Genres) %>% #filter and arrange
mutate(Genres=as.factor(Genres))
Title Genres
1 A Item1
2 A Item2
3 A Item3
4 B
5 C Item1
6 C Item4
答案 2 :(得分:0)
您可以按照Uwe的建议使用函数separate()
,但是您的流派顺序似乎并不总是相同。一种选择是使用mutate()
创建新列,并使用函数grepl()
来确定每个令牌是否存在。
df %>%
mutate(
Item1 = grepl('Item1', Genres),
Item2 = grepl('Item2', Genres),
Item3 = grepl('Item3', Genres),
Item4 = grepl('Item4', Genres)
)
# Title Genres Item1 Item2 Item3 Item4
# 1 A [Item1, Item2, Item3] TRUE TRUE TRUE FALSE
# 2 B FALSE FALSE FALSE FALSE
# 3 C [Item4, Item1] TRUE FALSE FALSE TRUE