我正在尝试为我的genres
集添加流派。但是,我的流派设置为NULL
。
功能:
install.packages("sets"); library(sets)
genres = set()
find_all_genres = function(genres_string) {
if (genres_string == "N/A") {
return(NA)
}
genres_list = strsplit(genres_string, ",\\s+")[[1]]
for (genre in genres_list) {
genres = genres | set(genre)
}
}
sapply(df2$Genre, FUN = find_all_genres)
样品:
> head(df2$Genre)
[1] "Documentary, Biography, Romance" "Short, Thriller" "Documentary" "Drama, Romance" "War, Short"
[6] "Documentary, Biography"
预期的输出将是单独的行:
genres = {"Action", "Drama", "Comedy"}
当然还有更多类型。
另外,我怎样才能加快我的功能?我是R的新手
答案 0 :(得分:1)
使用scan
将其读入并unique
删除重复项。 g
在最后的注释中给出。没有包使用。
unique(scan(text = g, what = "", sep = ",", na.strings = "N/A",
strip.white = TRUE, quiet = TRUE))
,并提供:
[1] "Documentary" "Biography" "Romance" "Short" "Thriller"
[6] "Drama" "War"
如果您希望对其进行排序,请使用sort
。
如果你想添加一些以前的值,将整个事物写成一个函数:
add <- function(...) {
unique(scan(text = c(...), what = "", sep = ",", na.strings = "N/A",
strip.white = TRUE, quiet = TRUE))
}
# examples
g_split <- add(g)
G <- c("Drama", "Comedy")
G <- add(G, g)
可重复形式的输入是:
g <- c("Documentary, Biography, Romance", "Short, Thriller", "Documentary",
"Drama, Romance", "War, Short", "Documentary, Biography")