更新功能范围之外的集合

时间:2018-02-25 23:00:23

标签: r set

我正在尝试为我的genres集添加流派。但是,我的流派设置为NULL

功能:

install.packages("sets"); library(sets)
genres = set()
find_all_genres = function(genres_string) {
  if (genres_string == "N/A") {
    return(NA)
  }
  genres_list = strsplit(genres_string, ",\\s+")[[1]]
  for (genre in genres_list) {
    genres = genres | set(genre)
  }
}

sapply(df2$Genre, FUN = find_all_genres)

样品:

> head(df2$Genre)
[1] "Documentary, Biography, Romance" "Short, Thriller"                 "Documentary"                     "Drama, Romance"                  "War, Short"                     
[6] "Documentary, Biography"  

预期的输出将是单独的行:

genres = {"Action", "Drama", "Comedy"}

当然还有更多类型。

另外,我怎样才能加快我的功能?我是R的新手

1 个答案:

答案 0 :(得分:1)

使用scan将其读入并unique删除重复项。 g在最后的注释中给出。没有包使用。

unique(scan(text = g, what = "", sep = ",", na.strings = "N/A", 
  strip.white = TRUE, quiet = TRUE))

,并提供:

[1] "Documentary" "Biography"   "Romance"     "Short"       "Thriller"   
[6] "Drama"       "War" 

如果您希望对其进行排序,请使用sort

功能

如果你想添加一些以前的值,将整个事物写成一个函数:

add <- function(...) {
    unique(scan(text = c(...), what = "", sep = ",", na.strings = "N/A", 
      strip.white = TRUE, quiet = TRUE))
}

# examples
g_split <- add(g)

G <- c("Drama", "Comedy")
G <- add(G, g)

注意

可重复形式的输入是:

g <- c("Documentary, Biography, Romance", "Short, Thriller", "Documentary", 
  "Drama, Romance", "War, Short", "Documentary, Biography")