我有以下数据集。我使用devtools::reproduce()
将数据样本放在此处,我只选择了需要帮助的列,即genres
。
下面列出了很多类型。我正在使用电影数据库,我只想使用那里列出的第一个类型(格式为:Genre1 | Genre2 | Genre3
)。
如何使用stringr
或其他包以我想要的方式解析此字符串数据?
最终结果是在回归模型中使用它。
> dput(droplevels(head(movie.cpi,4)))
structure(list(num_critic_for_reviews = c(723L, 302L, 813L, 462L
), director_facebook_likes = c(0L, 563L, 22000L, 475L), actor_3_facebook_likes = c(855L,
1000L, 23000L, 530L), actor_1_facebook_likes = c(1000L, 40000L,
27000L, 640L), gross = c(866161204.765035, 364628240.876025,
476821933.103659, 77736216.375), genres = structure(c(2L, 1L,
4L, 3L), .Label = c("Action|Adventure|Fantasy", "Action|Adventure|Fantasy|Sci-Fi",
"Action|Adventure|Sci-Fi", "Action|Thriller"), class = "factor"),
num_voted_users = c(886204L, 471220L, 1144337L, 212204L),
cast_total_facebook_likes = c(4834L, 48350L, 106759L, 1873L
), facenumber_in_poster = c(0L, 0L, 0L, 1L), num_user_for_reviews = c(3054L,
1238L, 2701L, 738L), content_rating = structure(c(1L, 1L,
1L, 1L), .Label = "PG-13", class = "factor"), budget = c(269925874.125874,
353545586.107091, 266006097.560976, 280583231.707317), title_year = c(2009L,
2007L, 2012L, 2012L), actor_2_facebook_likes = c(936L, 5000L,
23000L, 632L), imdb_score = c(7.9, 7.1, 8.5, 6.6), movie_facebook_likes = c(33000L,
0L, 164000L, 24000L)), .Names = c("num_critic_for_reviews",
"director_facebook_likes", "actor_3_facebook_likes", "actor_1_facebook_likes",
"gross", "genres", "num_voted_users", "cast_total_facebook_likes",
"facenumber_in_poster", "num_user_for_reviews", "content_rating",
"budget", "title_year", "actor_2_facebook_likes", "imdb_score",
"movie_facebook_likes"), row.names = c(NA, 4L), class = "data.frame")
答案 0 :(得分:0)
使用以下命令,您可以获取从该列中提取的第一个类型(我将其存储在名为genre_1
的新列中):
movie.cpi$genre_1 <- gsub( "\\|.*$", "", movie.cpi$genres)
movie.cpi[,c("genres", "genre_1")]
## genres genre_1
## 1 Action|Adventure|Fantasy|Sci-Fi Action
## 2 Action|Adventure|Fantasy Action
## 3 Action|Thriller Action
## 4 Action|Adventure|Sci-Fi Action