我正试图从一个游戏网站上抓取同一网站上的多个页面以进行评论。
我尝试运行它并更改其中的代码:R web scraping across multiple pages,其中一个是答案。
library(tidyverse)
library(rvest)
url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=0"
map_df(1:17, function(i) {
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(Name = html_text(html_nodes(pg,"#main .product_title a")),
MetaRating = as.numeric(html_text(html_nodes(pg,"#main .positive"))),
UserRating = as.numeric(html_text(html_nodes(pg,"#main .textscore"))),
stringsAsFactors = FALSE)
}) -> ps4games_metacritic
结果是首页被抓取了17次,而不是网站上的17页。
答案 0 :(得分:0)
我对您的代码进行了三处更改:
map_df(1:17...
应该是map_df(0:16...
url_base
的设置应如下:url_base <-
"https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"
"#main .positive"
,则在
取消第7页,因为没有正面得分的游戏开始
在那里-除非您只想放弃积极的游戏
您应该使用评估(这意味着代码有所不同)
改为"#main .game"
library(tidyverse)
library(rvest)
url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"
map_df(0:16, function(i) {
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(Name = html_text(html_nodes(pg,"#main .product_title a")),
MetaRating = as.numeric(html_text(html_nodes(pg,"#main .game"))),
UserRating = as.numeric(html_text(html_nodes(pg,"#main .textscore"))),
stringsAsFactors = FALSE)
}) -> ps4games_metacritic