我正在尝试从TripAdvisor摘录几页关于一个学术项目的评论。
这是我尝试使用R
#Load libraries
library(rvest)
library(RSelenium)
# main url for stadium
urlmainlist=c(
hampdenpark="http://www.tripadvisor.com.ph/Attraction_Review-g186534-d214132-Reviews-Hampden_Park-Glasgow_Scotland.html"
)
# Specify how many search pages and counter
morepglist=list(
hampdenpark=seq(10,360,10)
)
#----------------------------------------------------------------------------------------------------------
# create pickstadium variable
pickstadium="hampdenpark"
# get list of urllinks corresponding to different pages
# url link for first search page
urllinkmain=urlmainlist[pickstadium]
# counter for additional pages
morepg=as.numeric(morepglist[[pickstadium]])
urllinkpre=paste(strsplit(urllinkmain,"Reviews-")[[1]][1],"Reviews",sep="")
urllinkpost=strsplit(urllinkmain,"Reviews-")[[1]][2]
urllink=rep(NA,length(morepg)+1)
urllink[1]=urllinkmain
for(i in 1:length(morepg)){
urllink[i+1]=paste(urllinkpre,"-or",morepg[i],"-",urllinkpost,sep="")
}
head(urllink)
write.csv(urllink,'urllink.csv')
##########
#SCRAPING#
##########
library(rvest)
library(RSelenium)
#install.packages('RSelenium')
testurl <- read.csv("urllink.csv", header=FALSE, quote="'", stringsAsFactors = F)
testurl=testurl[-1,]
testurl=testurl[,-1]
testurl=as.data.frame(testurl)
testurl=gsub('"',"",testurl$testurl)
list<-unlist(testurl)
tripadvisor <- NULL
#Scrape
for(i in 1:length(list)){
reviews <- list[i] %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
id <- reviews %>%
html_node(".quote a") %>%
html_attr("id")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title") %>%
strptime("%b %d, %Y") %>%
as.POSIXct()
review <- reviews %>%
html_node(".entry .partial_entry") %>%
html_text()%>%
as.character()
rowthing <- data.frame(id, review,rating, date, stringsAsFactors = FALSE)
tripadvisor<-rbind(rowthing, tripadvisor)
}
但是,这将导致空tripadvisor
数据帧。解决此问题的任何帮助将不胜感激。
其他问题
我想捕获全部评论,因为我的代码当前仅打算捕获部分条目。对于每个评论,我想自动单击“ More
”链接,然后提取完整的评论。
在这里,我们将不胜感激任何帮助。