刮网页,链接页面,并用R形成表格

时间:2013-06-29 23:57:06

标签: xml r web-scraping

您好我是新手使用R从Internet上抓取数据,遗憾的是,对HTML和XML知之甚少。我试图在以下父页面上抓取每个故事链接:http://www.who.int/csr/don/archive/year/2013/en/index.html。我不关心父页面上的任何其他链接,但是需要为每个故事URL创建一个表,并为相应的URL,故事标题,日期(它始终位于故事标题后的第一句话,然后是页面的其余部分(可以是几段文字)。

我试图在Scraping a wiki page for the "Periodic table" and all the links(和几个相关的线程)调整代码,但遇到了困难。任何建议或指示将不胜感激。这是我到目前为止所尝试过的(“?????”,我遇到了麻烦):

rm(list=ls())
library(XML)
library(plyr) 

url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)

links = getNodeSet(doc, ?????)

df = ldply(doc, function(x) {
  text = xmlValue(x)
  if (text=='') text=NULL

  symbol = xmlGetAttr(x, '?????')
  link = xmlGetAttr(x, 'href')
  if (!is.null(text) & !is.null(symbol) & !is.null(link))
    data.frame(symbol, text, link)
} )

df = head(df, ?????)

1 个答案:

答案 0 :(得分:6)

您可以xpathSApply,(lapply equivalent),在给定Xpath的情况下搜索文档。

library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

编辑:添加每个故事的文字

dat$text = unlist(lapply(dat$hrefs,function(x)
  {
    url.story <- gsub('/entity','http://www.who.int',x)
    texts <- xpathSApply(htmlParse(url.story), 
                         '//*[@id="primary"]',xmlValue)
    }))