Question

我是webscraping的新手，我正在尝试构建一个使用R访问网站源代码/ html中信息的刮刀。

具体来说，我希望能够确定一个（多个）网站是否具有特定文本的ID：“google_ads_iframe”。 id总是比这长，所以我想我将不得不使用通配符。

我尝试了几种选择（见下文），但到目前为止还没有任何效果。

第一种方法：

doc <- htmlTreeParse("http://www.funda.nl/") 

data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)

错误消息显示：

Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"

第二种方法：

scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)

x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)

x作为空列表返回。

第三种方法：

scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)

同样，x作为空列表返回。

我无法弄清楚我做错了什么，所以任何帮助都会非常棒！

Answer 1

我的广告拦截器可能会阻止我查看Google广告iframe，但您不必浪费其他R功能的周期来测试某些内容的存在。让libxml2（支持rvest和xml2包）中的优化C函数为您完成工作，并使用boolean()包装您的XPath：

library(xml2)

pg <- read_html("http://www.funda.nl/")

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE

您需要处理的另一个问题是Google广告iframe很可能是在使用javascript加载页面后生成的，这意味着使用RSelenium来抓取页面源（然后您可以使用此方法结果页面源。）

<强>更新

我在其中找到了一个包含google_ads_iframe的网页示例：

pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")

xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE

xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3

这是一个呈现的页面，但我怀疑你仍然需要使用RSelenium进行页面抓取。以下是如何做到这一点（如果您使用合理的操作系统并安装了phantomjs，否则请使用Firefox）：

library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()

remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]

pg <- read_html()
...

# eventually (when done)
phantom_js$stop()

注意

我在codepen示例中使用的XPath（因为它有一个google广告iframe）是必要的。这是iframe存在的代码段：

<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;"> <script type="text/javascript"> googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); }); </script> <iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:"<html><body style='background:transparent'></body></html>"" style="border: 0px; vertical-align: bottom;"></iframe></div>

iframe标记是div的子标记，因此如果您想首先定位div，则必须添加子目标，如果要在其中查找属性

如何使用R从网站源代码/ html中抓取信息？

1 个答案: