如何使用rvest从IMDB中删除所有电影评论

时间:2016-12-21 06:41:03

标签: r web-scraping

我是网络抓取的新手,希望将其用于情感分析。我已经成功取消了前10条评论。对于280个其他评论,我犹豫重复以下过程超过20次...我想知道是否有一个软件包/功能允许我以更简单的方式刮取所有评论?非常感谢!

library(rvest)
library(XML)
library(plyr)
HouseofCards_IMDb <- read_html("http://www.imdb.com/title/tt1856010/reviews?ref_=tt_urv")

#Used SelectorGadget as the CSS Selector
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>%
html_nodes("div+p") %>%
html_text()

#perfrom data cleaning on user reviews
reviews <- gsub("\r?\n|\r", " ", reviews) 
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews))
sapply(reviews, function(x){})
print(reviews)

1 个答案:

答案 0 :(得分:2)

欢迎来到SO。

如果您转到第二页评论,您会注意到网址从http://www.imdb.com/title/tt1856010/reviews更改为http://www.imdb.com/title/tt1856010/reviews?start=10的方式。

最后一页:http://www.imdb.com/title/tt1856010/reviews?start=290

您所要做的就是遍历页面:

result <- c()
for(i in c(1, seq(10, 290, 10))) {
  link <- paste0("http://www.imdb.com/title/tt1856010/reviews?start=",i)
  HouseofCards_IMDb <- read_html(link)

  # Used SelectorGadget as the CSS Selector
  reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>%
    html_nodes("div+p") %>%
    html_text()

  # perfrom data cleaning on user reviews
  reviews <- gsub("\r?\n|\r", " ", reviews) 
  reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews))
  sapply(reviews, function(x){})
  result <- c(result, reviews)
}

请注意,我们从http://www.imdb.com/title/tt1856010/reviews?start=1开始,类似于http://www.imdb.com/title/tt1856010/reviews