通过从列中挑选烂番茄电影评分来处理数据集

时间:2019-06-20 15:32:33

标签: r tidyr

我有这个样本数据集:

structure(list(Title = c("Isn't It Romantic", "Isn't It Romantic", 
"Isn't It Romantic", "Isn't It Romantic", "Isn't It Romantic", 
"Isn't It Romantic", "Gully Boy", "Gully Boy", "Gully Boy", "Gully Boy", 
"Gully Boy", "Gully Boy", "The Wandering Earth", "The Wandering Earth", 
"The Wandering Earth", "The Wandering Earth", "The Wandering Earth", 
"The Wandering Earth", "How to Train Your Dragon: The Hidden World", 
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World", 
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World", 
"How to Train Your Dragon: The Hidden World", "American Woman", 
"American Woman", "Us", "Us", "Us", "Us", "Us", "Us", "The Wolf's Call", 
"The Wolf's Call", "Avengers: Endgame", "Avengers: Endgame", 
"Avengers: Endgame", "Avengers: Endgame", "Avengers: Endgame", 
"Avengers: Endgame", "The Silence", "The Silence", "The Silence", 
"The Silence", "The Silence", "The Silence", "My Little Pony: Equestria Girls: Spring Breakdown", 
"My Little Pony: Equestria Girls: Spring Breakdown"), Ratings = c("Internet Movie Database", 
"5.9/10", "Rotten Tomatoes", "68%", "Metacritic", "60/100", "Internet Movie Database", 
"8.4/10", "Rotten Tomatoes", "100%", "Metacritic", "65/100", 
"Internet Movie Database", "6.4/10", "Rotten Tomatoes", "74%", 
"Metacritic", "62/100", "Internet Movie Database", "7.6/10", 
"Rotten Tomatoes", "91%", "Metacritic", "71/100", "Rotten Tomatoes", 
"57%", "Internet Movie Database", "7.1/10", "Rotten Tomatoes", 
"94%", "Metacritic", "81/100", "Internet Movie Database", "7.6/10", 
"Internet Movie Database", "8.7/10", "Rotten Tomatoes", "94%", 
"Metacritic", "78/100", "Internet Movie Database", "5.2/10", 
"Rotten Tomatoes", "23%", "Metacritic", "25/100", "Internet Movie Database", 
"7.7/10")), row.names = c(NA, -48L), class = c("tbl_df", "tbl", 
"data.frame"))

enter image description here

Ratings列为每部电影提供3种不同类型的分级(Imdb,烂番茄和Metacritic),每部电影分布在6行中。

我想纠缠这个数据集,以便为每部电影创建一个名为rottentomatoes_rating的新列,并且这些值是等级。因此,在我的示例数据集中,《浪漫电影》在rottentomatoes_rating下的比例是否为68%,《沟壑男孩》在rottentomatoes_rating下的比例为100%,依此类推。

对于那些没有rottentomatoes_rating的电影,那么我想将NA放在rottentomatoes_rating下。

我曾经考虑过在提迪尔中使用spread,但是由于我的情况下变量和值都在同一列中,所以我还不太清楚该怎么做!

6 个答案:

答案 0 :(得分:2)

如果数据在整个数据集中的格式类似,则以下代码应该有效:

df %>% group_by(Title) %>% 
  slice(match("Rotten Tomatoes", df$Ratings) + 1) %>%
  rename(rottentomatoes_rating = Ratings)

这给出了:

# A tibble: 2 x 6
# Groups:   Title [2]
  Title             Year  Rated     Released   Runtime rottentomatoes_rating
  <chr>             <chr> <chr>     <date>     <chr>   <chr>                
1 Gully Boy         2019  Not Rated 2019-02-14 153 min 100%                 
2 Isn't It Romantic 2019  PG-13     2019-02-13 89 min  68%     

对于NA,如果原始数据始终具有RT分数,则在观察到字符串后的行,则默认情况下应该为您提供NA

答案 1 :(得分:2)

sumshyftw 的答案很好。

但是,如果您只是想获取烂番茄的百分比,这里是data.table版本:

dt <- dt[dt$Ratings %like% "%",]
dt <- setnames(dt, "Ratings", "rottentomatoes_rating")

输出:

# A tibble: 2 x 6
  Title             Year  Rated     Released   Runtime rottentomatoes_rating
  <chr>             <chr> <chr>     <date>     <chr>   <chr>                
1 Isn't It Romantic 2019  PG-13     2019-02-13 89 min  68%                  
2 Gully Boy         2019  Not Rated 2019-02-14 153 min 100%  

我之所以使用%like% "%"是因为我认为完整的数据就像您的示例一样。

答案 2 :(得分:2)

假设您的数据集称为dt,则可以使用此过程来获取数据集的简洁版本:

library(tidyverse)

# specify indexes of Rating companies
ids = seq(1, nrow(dt), 2)

# get rows of Rating companies
dt %>% slice(ids) %>%
  # combine with the rating values
  cbind(dt %>% slice(-ids) %>% select(RatingsValue = Ratings)) %>%
  # reshape dataset
  spread(Ratings, RatingsValue)

#                Title Year     Rated   Released Runtime Internet Movie Database Metacritic Rotten Tomatoes
# 1         Gully Boy 2019 Not Rated 2019-02-14 153 min                  8.4/10     65/100            100%
# 2 Isn't It Romantic 2019     PG-13 2019-02-13  89 min                  5.9/10     60/100             68%

答案 3 :(得分:1)

在空白时填充NA值的新版本

# using data.table
library(data.table)
dt <- as.data.table(df)

# Index will hold whether the row is a Provider eg Rotten Tomatoes, or a value
dt[, Index:=rep(c("Provider", "Value"), .N/2)]
# Need an index to bind these together
dt[, Provider.Id:=rep(1:(.N/2), each=2), by=Title]
dt[1:6,]

# segment out the Provider & Values in to columns
out <- dcast(dt, Title+Provider.Id~Index, value.var = "Ratings")
out[, Provider := NULL]

# now convert to full wide format 
out_df <- as.data.frame(dcast(out, Title~Provider, value.var="Value", fill=NA))
out_df

答案 4 :(得分:0)

要使用data.table

获取所有指标
# using data.table
library(data.table)
dt <- as.data.table(df)

# groups the data set with by, and extracts the Ratings
# makes use of logic that the odd indeces hold the name of the provider,
# the even ones hold the values. Only works if this holds.
# It can probably be optimised a bit. dcast converts from long to required wide
# format
splitRatings <- function(Ratings){
  # e.g. Ratings=dt$Ratings[1:6]
  N <- length(Ratings)
  split_dt <- data.table(DB=Ratings[1:N %% 2 == 1],
                         Values=Ratings[1-(1:N %% 2) == 1])
  out <- dcast(split_dt, .~DB, value.var = "Values")
  out[, ".":=NULL]
  out
}

# applies the function based on the by clause, returning the table embedded
dt2 <- dt[, splitRatings(Ratings), by=.(Title, Year, Rated, Released, Runtime)]

# convert back
out <- as.data.frame(dt2)

答案 5 :(得分:0)

这里是一个版本。

df %>% 
  mutate(Value = ifelse(str_detect(Ratings, "\\d"), Ratings, NA)) %>% 
  fill(Value, .direction = "up") %>% 
  filter(!str_detect(Ratings, "\\d")) %>% 
  spread(Ratings, Value)