将多个值字段转换为因数

时间:2018-08-09 16:53:21

标签: r

从csv文件中读取输入会给我留下一个包含多个值的奇数字段,例如

 Title                Genres
1     A [Item1, Item2, Item3]
2     B                      
3     C        [Item4, Item1]


df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"), 
           stringsAsFactors = FALSE)
colnames(df) <- c("Title","Genres")

获取单个令牌的功能

extractGenre <- function(genreVector){
  strsplit(substring(genreVector,2,nchar(genreVector)-1),", ")
} 

我对如何将第1项,...第4项转换为因子并将其附加到数据框感到有些困惑。尽管apply使我在每一行上执行该函数,但下一步将如何?

3 个答案:

答案 0 :(得分:1)

我不确定这是否正是您要寻找的东西,但是我的处理方式有所不同。我使用了dplyr和grepl:

    df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"), 
                     stringsAsFactors = FALSE)
    colnames(df) <- c("Title","Genres")
    df
    df1<-df%>%
      mutate(Item1 = ifelse(grepl("Item1",Genres), T,F),
             Item2 = ifelse(grepl("Item2",Genres), T,F),
             Item3 = ifelse(grepl("Item3",Genres), T,F),
             Item4 = ifelse(grepl("Item4",Genres), T,F))

 Title                Genres Item1 Item2 Item3 Item4
1     A [Item1, Item2, Item3]  TRUE  TRUE  TRUE FALSE
2     B                       FALSE FALSE FALSE FALSE
3     C        [Item4, Item1]  TRUE FALSE FALSE  TRUE

希望这会有所帮助

答案 1 :(得分:1)

library(dplyr)
library(tidyr)

df %>% mutate(Genres=gsub('\\[|\\]|\\s+','',Genres)) %>%  #remove []
       separate(Genres,paste0('Gen',1:3)) %>%             #separate Genres to multiple columns
       gather(key,Genres,-Title) %>% select(-key) %>%     #Gather to Genres columns
       filter(!is.na(Genres)) %>% arrange(Title,Genres) %>%    #filter and arrange
       mutate(Genres=as.factor(Genres))     


   Title Genres
1     A  Item1
2     A  Item2
3     A  Item3
4     B       
5     C  Item1
6     C  Item4              

答案 2 :(得分:0)

您可以按照Uwe的建议使用函数separate(),但是您的流派顺序似乎并不总是相同。一种选择是使用mutate()创建新列,并使用函数grepl()来确定每个令牌是否存在。

df %>% 
    mutate(
        Item1 = grepl('Item1', Genres),
        Item2 = grepl('Item2', Genres),
        Item3 = grepl('Item3', Genres),
        Item4 = grepl('Item4', Genres)
    )

#   Title                Genres Item1 Item2 Item3 Item4
# 1     A [Item1, Item2, Item3]  TRUE  TRUE  TRUE FALSE
# 2     B                       FALSE FALSE FALSE FALSE
# 3     C        [Item4, Item1]  TRUE FALSE FALSE  TRUE