如何读取和解析R中网页的内容

时间:2009-12-04 04:18:46

标签: html r screen-scraping html-content-extraction

我想在R中阅读网址(e.q。,http://www.haaretz.com/)的内容。我想知道如何做到这一点

3 个答案:

答案 0 :(得分:30)

不确定您要如何处理该页面,因为它非常混乱。正如我们re-learned in this famous stackoverflow question一样,在html上执行regex并不是一个好主意,因此您肯定希望使用XML包解析它。

这是一个让你入门的例子:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

这会产生一个主要只是网页文字的字符向量(以及一些javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
[4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 

答案 1 :(得分:3)

您最好的选择可能是XML包 - 例如参见previous question

答案 2 :(得分:2)

我知道你问过R.但也许python + beautifullsoup是前进的方向?然后用R你用beautifullsoup刮掉屏幕进行分析吗?