我正在尝试从电影计中删除电影名称,评级和年份,以便将其与imdb进行比较。我设法将imdb前250部电影放入一个标题,评级,等级,年份的数据框中。但我似乎没有让电影表运行..
这是我的代码:
url <- rvest::html("https://www.moviemeter.nl/list/")
scrapemoviemeter <- rvest::html_nodes(x = url, css = ".film_row")
head(scrapemoviemeter)
moviemeter <- rvest::html_text(scrapemoviemeter, trim = TRUE)
现在我有了
的影像计价值head(moviemeter)
[1] "4,42 (15202)1. The Shawshank Redemption (1994)"
[2] "4,36 (9761)2. The Godfather (1972)Alternatieve titel: Mario Puzo's The Godfather"
如何将数据放入以评级,标题和年份分隔的数据框中?
答案 0 :(得分:1)
如果您有IMDB ID,请使用MovieMeter API与抓取:
library(moviemeter) # devtools::install_github("hrbrmstr/moviemeter")
library(purrr)
imdb_ids <- c("tt1107846", "tt0282552", "tt0048199")
map_df(imdb_ids, function(x) {
mm <- mm_get_movie_info(x)
mm <- map(mm, ~. %||% NA) # the javascript has nulls, so get rid of them
mm[c(1:11)] # remove posters, countries, genres, actors and directors
}) -> df
dplyr::glimpse(df)
## Observations: 3
## Variables: 11
## $ id <int> 57161, 6465, 33351
## $ url <chr> "https://www.moviemeter.nl/film/57161", "https://www.moviemeter.nl/film/6465", "https://www.moviemeter.nl/film/33351"
## $ year <int> 2007, 2002, 1955
## $ imdb <chr> "tt1107846", "tt0282552", "tt0048199"
## $ title <chr> "Theft", "Riders", "Illegal"
## $ display_title <chr> "Theft", "Riders", "Illegal"
## $ alternative_title <chr> NA, "Steal", NA
## $ plot <chr> "Een naïeve dorpsjongen wordt verliefd op een crimineel. Guy was altijd een nette beschaafde jongen, wie had er ooi...
## $ duration <int> 90, 83, 88
## $ votes_count <int> 1, 293, 20
## $ average <dbl> 2.00, 2.55, 3.42
如果你试图将IMDB前250名与MovieMeter前250名进行比较,那么你必须抓紧,因为他们的API非常有限。
请记住引用他们从这项工作中做出的任何事情,并警惕抓取IMDB。 LinkedIn在2016年起诉了一堆刮刀,人们将在未来几个月/几年内更加认真地对待知识产权。
答案 1 :(得分:0)
我认为使用xpath更容易。试试这个
library(rvest)
library(stringi)
url <- rvest::html("https://www.moviemeter.nl/list/")
scores <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]/*//span[@class='score']")
scores <- rvest::html_text(scores, trim = TRUE)
names <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]/*//a[@class='tooltip']")
names <- rvest::html_text(names, trim = TRUE)
years <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]//div[@class='film_row']/text() ")
years <- rvest::html_text(years, trim = TRUE)
years <- stri_extract(years, regex = "\\b\\d{4}\\b")
years <- years[!is.na(years)]
names <- unlist(names)
scores <- unlist(scores)
years <- unlist(years)
df <- cbind(names, scores, years)
df <- as.data.frame(df)