在R

时间:2016-01-05 16:19:33

标签: r web-scraping rvest

我发现R中的网页抓取任务通常可以通过获取生成网页的html代码,使用易于使用的rvest包来实现。然而,当网站使用Javascript来显示相关数据时,这种“通常”的方法(我可以称之为)似乎错过了一些功能。作为一个工作示例,我想从this网站抓取新闻标题。通常方法的两个主要障碍包括底部的“加载更多”按钮和使用xpath提取标题。特别是:

library(rvest)
library(magrittr)

url = "http://www.nestle.com/media/news-archive#agregator-search-results"
webs = read_html(url)

# Headline of the first news based on its xpath
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[1]") %>% html_text
#[1] ""

# Same for the description of the first news
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[2]") %>% html_text
#[1] ""

也许有人可以解释(以下)以下问题之一:

  1. 我想念一些明显的东西吗?也就是说,在这种情况下,是否有可能使用基于rvest的常规方法来抓取标题?至于我目前的理解,情况并非如此。
  2. RSeleniumphantom JS是唯一的方式吗?换句话说,可以在不使用RSeleniumphantomJS的情况下完成任务吗?这可能包括提取标题或加载更多标题(或两者)。
  3. 赞赏任何意见。

1 个答案:

答案 0 :(得分:1)

Imo,有时候在后台查找原始数据会更好:

library(jsonlite)
library(RCurl)
n <- 8 # number of news items to pull
useragent <- "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0"
url <- sprintf("http://www.nestle.com/_handlers/advancedsearch.ashx?q=Nestle%%2Bdaterange%%3A..2016-01-05&index=0&num=%d&client=Nestle_Corp&site=Nestle_Corp_Media&requiredfields=MediaType:/media/pressreleases/allpressreleases|MediaType:/Media/NewsAndFeatures|MediaType:/Media/News&sort=date:D:R:d1&filter=p&access=p&entsp=a&oe=UTF-8&ie=UTF-8&ud=1&ProxyReload=1&exclude_apps=1&entqr=3&getfields=*", n)
json <- getURL(url, useragent=useragent)
res <- fromJSON(json)
df <- res$GSP$RES$R
head(cbind(df[, c("U", "T")], df$FS$'@VALUE'))
#                                                                                                U                                                                                 T df$FS$"@VALUE"
# 1                                   http://www.nestle.com/media/newsandfeatures/nestle-150-years &#39;Good Food, Good Life&#39;: Celebrating 150 years of <b>Nestlé</b> <b>...</b>     2016-01-01
# 2                                   http://www.nestle.com/media/newsandfeatures/2015-in-pictures                                           2015 in pictures | <b>Nestlé</b> Global     2015-12-23
# 3                         http://www.nestle.com/media/news/nescafe-dolce-gusto-expands-in-brazil                Coffee superstar: Nescafé Dolce Gusto expands in Brazil <b>...</b>     2015-12-17
# 4                        http://www.nestle.com/media/news/nestle-waters-new-bottling-plant-italy  <b>Nestlé</b> Waters needs youth, for its new bottling plant in Italy <b>...</b>     2015-12-10
# 5 http://www.nestle.com/media/news/nestle-launch-wellness-club-personalised-health-service-japan     Matcha made in nutritional heaven: <b>Nestlé</b> launches Wellness <b>...</b>     2015-12-08
# 6        http://www.nestle.com/media/news/nestle-completes-chf-8-billion-share-buyback-programme          <b>Nestlé</b> completes CHF 8 billion share buyback programme <b>...</b>     2015-12-07

df包含更多信息,如果您想使用它,其中一些信息必须被取消。