您好我是新手使用R从Internet上抓取数据,遗憾的是,对HTML和XML知之甚少。我试图在以下父页面上抓取每个故事链接:http://www.who.int/csr/don/archive/year/2013/en/index.html。我不关心父页面上的任何其他链接,但是需要为每个故事URL创建一个表,并为相应的URL,故事标题,日期(它始终位于故事标题后的第一句话,然后是页面的其余部分(可以是几段文字)。
我试图在Scraping a wiki page for the "Periodic table" and all the links(和几个相关的线程)调整代码,但遇到了困难。任何建议或指示将不胜感激。这是我到目前为止所尝试过的(“?????”,我遇到了麻烦):
rm(list=ls())
library(XML)
library(plyr)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
links = getNodeSet(doc, ?????)
df = ldply(doc, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, '?????')
link = xmlGetAttr(x, 'href')
if (!is.null(text) & !is.null(symbol) & !is.null(link))
data.frame(symbol, text, link)
} )
df = head(df, ?????)
答案 0 :(得分:6)
您可以xpathSApply
,(lapply equivalent),在给定Xpath的情况下搜索文档。
library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))
## dates hrefs
## 1 26 June 2013 /entity/csr/don/2013_06_26/en/index.html
## 2 23 June 2013 /entity/csr/don/2013_06_23/en/index.html
## 3 22 June 2013 /entity/csr/don/2013_06_22/en/index.html
## 4 17 June 2013 /entity/csr/don/2013_06_17/en/index.html
## story
## 1 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
dat$text = unlist(lapply(dat$hrefs,function(x)
{
url.story <- gsub('/entity','http://www.who.int',x)
texts <- xpathSApply(htmlParse(url.story),
'//*[@id="primary"]',xmlValue)
}))