我有这个样本数据集:
structure(list(Title = c("Isn't It Romantic", "Isn't It Romantic",
"Isn't It Romantic", "Isn't It Romantic", "Isn't It Romantic",
"Isn't It Romantic", "Gully Boy", "Gully Boy", "Gully Boy", "Gully Boy",
"Gully Boy", "Gully Boy", "The Wandering Earth", "The Wandering Earth",
"The Wandering Earth", "The Wandering Earth", "The Wandering Earth",
"The Wandering Earth", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "American Woman",
"American Woman", "Us", "Us", "Us", "Us", "Us", "Us", "The Wolf's Call",
"The Wolf's Call", "Avengers: Endgame", "Avengers: Endgame",
"Avengers: Endgame", "Avengers: Endgame", "Avengers: Endgame",
"Avengers: Endgame", "The Silence", "The Silence", "The Silence",
"The Silence", "The Silence", "The Silence", "My Little Pony: Equestria Girls: Spring Breakdown",
"My Little Pony: Equestria Girls: Spring Breakdown"), Ratings = c("Internet Movie Database",
"5.9/10", "Rotten Tomatoes", "68%", "Metacritic", "60/100", "Internet Movie Database",
"8.4/10", "Rotten Tomatoes", "100%", "Metacritic", "65/100",
"Internet Movie Database", "6.4/10", "Rotten Tomatoes", "74%",
"Metacritic", "62/100", "Internet Movie Database", "7.6/10",
"Rotten Tomatoes", "91%", "Metacritic", "71/100", "Rotten Tomatoes",
"57%", "Internet Movie Database", "7.1/10", "Rotten Tomatoes",
"94%", "Metacritic", "81/100", "Internet Movie Database", "7.6/10",
"Internet Movie Database", "8.7/10", "Rotten Tomatoes", "94%",
"Metacritic", "78/100", "Internet Movie Database", "5.2/10",
"Rotten Tomatoes", "23%", "Metacritic", "25/100", "Internet Movie Database",
"7.7/10")), row.names = c(NA, -48L), class = c("tbl_df", "tbl",
"data.frame"))
Ratings
列为每部电影提供3种不同类型的分级(Imdb,烂番茄和Metacritic),每部电影分布在6行中。
我想纠缠这个数据集,以便为每部电影创建一个名为rottentomatoes_rating
的新列,并且这些值是等级。因此,在我的示例数据集中,《浪漫电影》在rottentomatoes_rating
下的比例是否为68%,《沟壑男孩》在rottentomatoes_rating
下的比例为100%,依此类推。
对于那些没有rottentomatoes_rating
的电影,那么我想将NA放在rottentomatoes_rating
下。
我曾经考虑过在提迪尔中使用spread
,但是由于我的情况下变量和值都在同一列中,所以我还不太清楚该怎么做!
答案 0 :(得分:2)
如果数据在整个数据集中的格式类似,则以下代码应该有效:
df %>% group_by(Title) %>%
slice(match("Rotten Tomatoes", df$Ratings) + 1) %>%
rename(rottentomatoes_rating = Ratings)
这给出了:
# A tibble: 2 x 6
# Groups: Title [2]
Title Year Rated Released Runtime rottentomatoes_rating
<chr> <chr> <chr> <date> <chr> <chr>
1 Gully Boy 2019 Not Rated 2019-02-14 153 min 100%
2 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 68%
对于NA
,如果原始数据始终具有RT分数,则在观察到字符串后的行,则默认情况下应该为您提供NA
。
答案 1 :(得分:2)
sumshyftw 的答案很好。
但是,如果您只是想获取烂番茄的百分比,这里是data.table
版本:
dt <- dt[dt$Ratings %like% "%",]
dt <- setnames(dt, "Ratings", "rottentomatoes_rating")
输出:
# A tibble: 2 x 6
Title Year Rated Released Runtime rottentomatoes_rating
<chr> <chr> <chr> <date> <chr> <chr>
1 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 68%
2 Gully Boy 2019 Not Rated 2019-02-14 153 min 100%
我之所以使用%like% "%"
是因为我认为完整的数据就像您的示例一样。
答案 2 :(得分:2)
假设您的数据集称为dt
,则可以使用此过程来获取数据集的简洁版本:
library(tidyverse)
# specify indexes of Rating companies
ids = seq(1, nrow(dt), 2)
# get rows of Rating companies
dt %>% slice(ids) %>%
# combine with the rating values
cbind(dt %>% slice(-ids) %>% select(RatingsValue = Ratings)) %>%
# reshape dataset
spread(Ratings, RatingsValue)
# Title Year Rated Released Runtime Internet Movie Database Metacritic Rotten Tomatoes
# 1 Gully Boy 2019 Not Rated 2019-02-14 153 min 8.4/10 65/100 100%
# 2 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 5.9/10 60/100 68%
答案 3 :(得分:1)
在空白时填充NA值的新版本
# using data.table
library(data.table)
dt <- as.data.table(df)
# Index will hold whether the row is a Provider eg Rotten Tomatoes, or a value
dt[, Index:=rep(c("Provider", "Value"), .N/2)]
# Need an index to bind these together
dt[, Provider.Id:=rep(1:(.N/2), each=2), by=Title]
dt[1:6,]
# segment out the Provider & Values in to columns
out <- dcast(dt, Title+Provider.Id~Index, value.var = "Ratings")
out[, Provider := NULL]
# now convert to full wide format
out_df <- as.data.frame(dcast(out, Title~Provider, value.var="Value", fill=NA))
out_df
答案 4 :(得分:0)
要使用data.table
# using data.table
library(data.table)
dt <- as.data.table(df)
# groups the data set with by, and extracts the Ratings
# makes use of logic that the odd indeces hold the name of the provider,
# the even ones hold the values. Only works if this holds.
# It can probably be optimised a bit. dcast converts from long to required wide
# format
splitRatings <- function(Ratings){
# e.g. Ratings=dt$Ratings[1:6]
N <- length(Ratings)
split_dt <- data.table(DB=Ratings[1:N %% 2 == 1],
Values=Ratings[1-(1:N %% 2) == 1])
out <- dcast(split_dt, .~DB, value.var = "Values")
out[, ".":=NULL]
out
}
# applies the function based on the by clause, returning the table embedded
dt2 <- dt[, splitRatings(Ratings), by=.(Title, Year, Rated, Released, Runtime)]
# convert back
out <- as.data.frame(dt2)
答案 5 :(得分:0)
这里是一个版本。
df %>%
mutate(Value = ifelse(str_detect(Ratings, "\\d"), Ratings, NA)) %>%
fill(Value, .direction = "up") %>%
filter(!str_detect(Ratings, "\\d")) %>%
spread(Ratings, Value)