我是网络抓取的新手,希望将其用于情感分析。我已经成功取消了前10条评论。对于280个其他评论,我犹豫重复以下过程超过20次...我想知道是否有一个软件包/功能允许我以更简单的方式刮取所有评论?非常感谢!
library(rvest)
library(XML)
library(plyr)
HouseofCards_IMDb <- read_html("http://www.imdb.com/title/tt1856010/reviews?ref_=tt_urv")
#Used SelectorGadget as the CSS Selector
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>%
html_nodes("div+p") %>%
html_text()
#perfrom data cleaning on user reviews
reviews <- gsub("\r?\n|\r", " ", reviews)
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews))
sapply(reviews, function(x){})
print(reviews)
答案 0 :(得分:2)
欢迎来到SO。
如果您转到第二页评论,您会注意到网址从http://www.imdb.com/title/tt1856010/reviews
更改为http://www.imdb.com/title/tt1856010/reviews?start=10
的方式。
最后一页:http://www.imdb.com/title/tt1856010/reviews?start=290
。
您所要做的就是遍历页面:
result <- c()
for(i in c(1, seq(10, 290, 10))) {
link <- paste0("http://www.imdb.com/title/tt1856010/reviews?start=",i)
HouseofCards_IMDb <- read_html(link)
# Used SelectorGadget as the CSS Selector
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>%
html_nodes("div+p") %>%
html_text()
# perfrom data cleaning on user reviews
reviews <- gsub("\r?\n|\r", " ", reviews)
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews))
sapply(reviews, function(x){})
result <- c(result, reviews)
}
请注意,我们从http://www.imdb.com/title/tt1856010/reviews?start=1
开始,类似于http://www.imdb.com/title/tt1856010/reviews
。