我正在使用包XML和HTML进行一些网页抓取,我需要隔离国家/地区名称,以及您在下面看到的两个数值:
<tr><td>Tonga</td>
<td class="RightAlign">3,000</td>
<td class="RightAlign">6,000</td>
</tr>
这是我到目前为止编写的代码 - 我认为我只需要正确的正则表达式?
# a vector to store the results
pages<-character(0)
country_names<-character(0)
# go through all 6 pages containing the info we want, and store
# the html in a list
for (page in 1:6) {
who_search <- paste(who_url, page, '.html', sep='')
page = htmlTreeParse(who_search, useInternalNodes = T)
pages=c(page, pages)
# extract the country names of each tweet
country <- xpathSApply(page, "????", xmlValue)
country_names<-c(country, country_names)
}
答案 0 :(得分:4)
此处无需使用xmlSpathApply
,而是使用readHTMLTable
library(XML)
library(RCurl)
page = htmlParse('http://www.who.int/diabetes/facts/world_figures/en/index4.html')
readHTMLTable(page)
Country 2000 2030
1 Albania 86,000 188,000
2 Andora 6,000 18,000
3 Armenia 120,000 206,000
4 Austria 239,000 366,000
5 Azerbaijan 337,000 733,000
6 Belarus 735,000 922,000
使用xpathSApply
(注意使用gsub来清理结果)
country <- xpathSApply(page, '//*[@id="primary"]/table/tbody/tr',
function(x) gsub('\n','' ,xmlValue(x))
+ )
> country
[1] "Albania 86,000 188,000 "
[2] "Andora 6,000 18,000 "
[3] "Armenia 120,000 206,000 "
[4] "Austria 239,000 366,000 "
[5] "Azerbaijan 337,000 733,000 "
编辑正如评论中所提到的,我们可以使用xpathSApply而不使用gsub
val = xpathSApply(page, '//tbody/tr/td', xmlValue) ##gets a vector of table
as.data.frame(matrix(val, ncol=3, byrow=TRUE)) ##transform to matrix