使用下面给出的xpath,从各个页面获取日期内容,我得到了预期的结果。但是,当使用chrome inspector时,此页面“ http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss”给出了所需的结果,而在R中使用相同的xpath则没有结果。
在chrome中使用下面的xpath。
xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])'
我明白了
在R中使用库“ xml2”使用相同的xpath。
我得到了节点集0
library(dplyr)
library(xml2)
html_page<-read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")
html_page%>%
xml_find_all(xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
#> {xml_nodeset (0)}
我错过了什么吗?
答案 0 :(得分:1)
以上内容的推论:
decapitated
:library(rvest)
library(decapitated)
library(tidyverse)
doc <- decapitated::chrome_read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")
html_nodes(doc, xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
## {xml_nodeset (1)}
## [1] <span class="s-blog-date">August 4, 2018</span>
请在需要Chrome时阅读README和pkg文档(最好使用软件包中介绍的独立的Chromium二进制文件)和环境变量设置,并且必须自行调试所有设置问题。
splashr
splashr
软件包需要reticulate
软件包,Docker和Python docker
模块。因此,如果遇到问题,则会进行更多的自我调试:
library(rvest)
library(splashr)
library(tidyverse)
sp <- splashr::start_splash()
doc <- render_html(splash_local, "http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")
html_nodes(doc, xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
## {xml_nodeset (1)}
## [1] <span class="s-blog-date">August 4, 2018</span>
killall_splash()
V8
为避免使用外部程序,可以使用V8
处理页面变量并获取内容:
library(rvest)
library(V8)
library(tidyverse)
ctx <- v8()
doc <- read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")
html_nodes(doc, xpath=".//script")[[1]] %>% # get 1st <script>
html_text() %>% # get contents of it
str_replace(regex("^.*window\\.", multiline=TRUE), "var $S = {};\n") %>% # make the variable usable in V8
ctx$eval() # evaluate the javascript
## [1] "[object Object]"
pg <- ctx$get("$S") # marshall it to R
这是一个很大的结构,因此请对其进行系统检查:
str(pg, 1)
## List of 6
## $ globalConf :List of 26
## $ conf :List of 12
## $ miniProgramAppType: NULL
## $ blogPostData :List of 5
## $ siteData :List of 5
## $ stores :List of 3
str(pg$blogPostData, 1)
## List of 5
## $ blogPostMeta:List of 25
## $ pageMeta :List of 33
## $ content :List of 8
## $ settings :List of 2
## $ pageMode : NULL
str(pg$blogPostData$content, 1)
## List of 8
## $ type : chr "Blog.BlogData"
## $ id : chr "f_cc4ace2d-21ed-4b94-83a0-e83497e5afc4"
## $ defaultValue : NULL
## $ showComments : logi TRUE
## $ showShareButtons: NULL
## $ header :List of 6
## $ footer :List of 5
## $ sections :'data.frame': 9 obs. of 4 variables:
内容似乎在这里:
str(pg$blogPostData$content$sections)
## 'data.frame': 9 obs. of 4 variables:
## $ type : chr "Blog.Section" "Blog.Section" "Blog.Section" "Blog.Section" ...
## $ id : chr "f_9ca5a1d7-ccb8-4315-9883-bcd43d271b9c" "f_4b7b30f1-387c-4cbe-aaed-ddaedea92cc1" "f_252813ac-b6cb-484b-81f5-64d7f0745c8e" "f_bd7412a4-b94b-4c5a-8cdd-a48931639dce" ...
## $ defaultValue: logi NA NA NA NA NA NA ...
## $ component :'data.frame': 9 obs. of 6 variables:
## ..$ type : chr "RichText" "RichText" "RichText" "RichText" ...
## ..$ id : chr "f_4e41d6f3-8449-4f66-b701-28d1bcfb08c9" "f_c27703de-8679-4916-9697-220cb8c7a74d" "f_c3c20474-99fc-434a-aff1-102d2a342450" "f_7b3e5247-39ef-42c7-b95c-f0be0b6e9728" ...
## ..$ defaultValue: logi FALSE NA NA NA NA NA ...
## ..$ value : chr "<p style=\"text-align: justify;\">We all make our plans beforehand in order to avoid any unnecessary issues. So"| __truncated__ "<p style=\"text-align: justify;\">Take a glance at the below-listed events and plan accordingly -</p>" "<p style=\"text-align: justify;\"><u>Siestes dominicales</u> – Here you are invited to groove on the grass and "| __truncated__ "<p style=\"text-align: justify;\"><u>Sonoboat ACT</u> – Neptune is one the most popular and historic sailing bo"| __truncated__ ...
## ..$ backupValue : logi NA NA NA NA NA NA ...
## ..$ version : int 1 NA NA NA NA NA NA 1 1
要么单独评估value
,要么paste0()
评估为一个HTML块,然后进行评估。
顺便说一句,Strikingly拥有我一段时间以来见过的daftest和低内容完整性/安全性发布解决方案之一。我知道您只是在抓取它,但我建议所有考虑使用它们的人不要使用它们。