R中的Webscraping moviemeter

时间:2016-10-08 13:28:09

标签: r web-scraping

我正在尝试从电影计中删除电影名称,评级和年份,以便将其与imdb进行比较。我设法将imdb前250部电影放入一个标题,评级,等级,年份的数据框中。但我似乎没有让电影表运行..

这是我的代码:

url <- rvest::html("https://www.moviemeter.nl/list/")
 scrapemoviemeter <- rvest::html_nodes(x = url, css = ".film_row")
 head(scrapemoviemeter)
 moviemeter <- rvest::html_text(scrapemoviemeter, trim = TRUE)

现在我有了

的影像计价值
head(moviemeter)
[1] "4,42 (15202)1. The Shawshank Redemption (1994)"                                                    
[2] "4,36 (9761)2. The Godfather (1972)Alternatieve titel: Mario Puzo's The Godfather" 

如何将数据放入以评级,标题和年份分隔的数据框中?

2 个答案:

答案 0 :(得分:1)

如果您有IMDB ID,请使用MovieMeter API与抓取:

library(moviemeter) # devtools::install_github("hrbrmstr/moviemeter")
library(purrr)

imdb_ids <- c("tt1107846", "tt0282552", "tt0048199")

map_df(imdb_ids, function(x) {
  mm <- mm_get_movie_info(x)
  mm <- map(mm, ~. %||% NA)  # the javascript has nulls, so get rid of them
  mm[c(1:11)]                # remove posters, countries, genres, actors and directors
}) -> df

dplyr::glimpse(df)
## Observations: 3
## Variables: 11
## $ id                <int> 57161, 6465, 33351
## $ url               <chr> "https://www.moviemeter.nl/film/57161", "https://www.moviemeter.nl/film/6465", "https://www.moviemeter.nl/film/33351"
## $ year              <int> 2007, 2002, 1955
## $ imdb              <chr> "tt1107846", "tt0282552", "tt0048199"
## $ title             <chr> "Theft", "Riders", "Illegal"
## $ display_title     <chr> "Theft", "Riders", "Illegal"
## $ alternative_title <chr> NA, "Steal", NA
## $ plot              <chr> "Een naïeve dorpsjongen wordt verliefd op een crimineel. Guy was altijd een nette beschaafde jongen, wie had er ooi...
## $ duration          <int> 90, 83, 88
## $ votes_count       <int> 1, 293, 20
## $ average           <dbl> 2.00, 2.55, 3.42

如果你试图将IMDB前250名与MovieMeter前250名进行比较,那么你必须抓紧,因为他们的API非常有限。

请记住引用他们从这项工作中做出的任何事情,并警惕抓取IMDB。 LinkedIn在2016年起诉了一堆刮刀,人们将在未来几个月/几年内更加认真地对待知识产权。

答案 1 :(得分:0)

我认为使用xpath更容易。试试这个

library(rvest)
library(stringi)

url <- rvest::html("https://www.moviemeter.nl/list/")
scores <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]/*//span[@class='score']")
scores <- rvest::html_text(scores, trim = TRUE)
names <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]/*//a[@class='tooltip']")
names <- rvest::html_text(names, trim = TRUE)
years <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]//div[@class='film_row']/text() ")
years <- rvest::html_text(years, trim = TRUE)
years <- stri_extract(years, regex = "\\b\\d{4}\\b")
years <- years[!is.na(years)]

names <- unlist(names)
scores <- unlist(scores)
years <- unlist(years)

df <- cbind(names, scores, years)
df <- as.data.frame(df)