如何从需要在r中进行交互的网页中抓取文本

时间:2018-05-09 23:04:47

标签: r web-scraping rvest

我正在尝试从网页上搜索评论以确定单词频率。但是,审核时间较长时,只会进行部分审核。您必须单击“更多”才能使网页显示完整评论。这是我用来提取评论文本的代码。如何“点击”更多以获得完整评论?

library(rvest)

tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704- 
Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"

webpage <-read_html(tripAdvisorURL)

reviewData <- xml_nodes(webpage,xpath = '//*[contains(concat( " ", @class, " 
" ), concat( " ", "partial_entry", " " ))]')

head(reviewData)

xml_text(reviewData[[1]])

[1] "The rooms were clean and we slept so good we had room 10 and 12 we 
didn’t use 12 but it joins 10 .kind of strange but loved the hotel ..me 
personally I would take the hot tub out it was kinda old..the lady 
that...More"

1 个答案:

答案 0 :(得分:1)

如评论中所述,您可以将Rselenium与rvest一起使用以获得更多交互性:

library(RSelenium)

rmDr <- rsDriver(browser = "chrome")

myclient <- rmDr$client
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
myclient$navigate(tripAdvisorURL)
#select all "more" button, and loop to click them
webEles <- myclient$findElements(using = "css",value = ".ulBlueLinks")
for (webEle in webEles) {
    webEle$clickElement()
}

mypagesource <- myclient$getPageSource()

read_html(mypagesource[[1]]) %>%
    html_nodes(".partial_entry") %>%
    html_text()