我正在使用此代码从旅行顾问中提取数据。
install.packages("rvest")
library(rvest)
install.packages("xmlparsedata")
library(xmlparsedata)
install.packages("xml2")
library(xml2)
install.packages("XML")
library(XML)
url.1 <- "https://www.tripadvisor.ie/Restaurant_Review-g186605-d4046860-
Reviews-The_Stage_Door_Cafe-Dublin_County_Dublin.html"
reviews <- url.1 %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
id <- reviews %>%
html_node(".quote a") %>%
html_attr("id")
quote <- reviews %>%
html_node(".quote span") %>%
html_text()
rating <- reviews %>%
html_node(".rating .rating.bubble") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
date <- reviews %>%
html_node(".ratingDate .relativeDate") %>%
html_attr("title") %>%
strptime("%b %d, %Y") %>%
as.POSIXct()
review <- reviews %>%
html_node(".entry .partial_entry" ) %>%
html_text()
a.1 <- data.frame(id, quote, rating, date, review, stringsAsFactors = FALSE)
我在这里面临的问题是评论中的“更多”按钮,由于它是从R存档的,因此我无法使用Rselenium软件包单击它。
install.packages("seleniumPipes")
library(seleniumPipes)
install.packages("devtools")
library(devtools)
ra <- "https://cran.r-
project.org/src/contrib/Archive/rappdirs/rappdirs_0.3.tar.gz"
install.packages(ra, repos=NULL, type="source", dependencies = TRUE)
library(rappdirs)
sem <- "https://cran.r-
project.org/src/contrib/Archive/semver/semver_0.1.0.tar.gz"
install.packages(sem, repos=NULL, type="source", dependencies = TRUE)
library(semver)
bin <- "https://cran.r-
project.org/src/contrib/Archive/binman/binman_0.0.7.tar.gz"
install.packages(bin, repos=NULL, type="source", dependencies = TRUE)
library(binman)
sub <- "https://cran.r-project.org/src/contrib/subprocess_0.8.2.tar.gz"
install.packages(sub, repos=NULL, type="source")
library(subprocess)
wd <- "https://cran.r-
project.org/src/contrib/Archive/wdman/wdman_0.2.2.tar.gz"
install.packages(wd, repos=NULL, type="source", dependencies = TRUE)
library(wdman)
packageurl <- "https://cran.r-
project.org/src/contrib/Archive/RSelenium/RSelenium_1.6.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
library(RSelenium)
我已经手动尝试安装所有已归档的软件包,但都徒劳无功,无法启动selenium。我也尝试过在docker上安装Selenium,但是没有运气。
remDr <- RSelenium::remoteDriver(remoteServerAddr = "192.168.43.66",
+ port = 4444L,
+ browserName = "phantomjs")
成功,但是当我输入
remDr $ open() 出现以下错误。
1“正在连接到远程服务器” checkError(res)中的错误: httr调用中发生未定义的错误。 httr输出:无法连接到10.3.100.207端口4444:网络无法访问
是否还有其他解决方法,可以使用rvest软件包单击“更多”按钮?因为这个RSelenium有点过时了。
这是“更多”按钮的屏幕快照链接