使用R2HTML和rvest / xml2

时间:2015-06-22 16:25:42

标签: xml r rvest

我正在阅读关于新包XML2的this博文。以前,rvest过去依赖于XML,并且通过在两个包中组合函数,它(至少)使我的工作更轻松:例如,我将使用XML中的htmlParse我无法使用html(现在称为read_html)阅读HTML页面时打包。

有关示例,请参阅this,然后我可以在已解析的页面上使用rvesthtml_nodeshtml_attr函数。现在,rvest取决于XML2,这是不可能的(至少在表面上)。

我只是想知道XML和XML2之间的基本区别是什么。除了在前面提到的post中归纳XML包的作者之外,包的作者并没有解释XML和XML2之间的差异。

另一个例子:

library(R2HTML) #save page as html and read later
library(XML)
k1<-htmlParse("https://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml")
head(getHTMLLinks(k1),5) #This works

[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

# But, I want to save HTML file now in my working directory and work later

HTML(k1,"k1") #Later I can work with this
rm(k1)
#read stored html file k1
head(getHTMLLinks("k1"),5)#This works too 

[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

#with read_html in rvest package, this is not possible (as I know)
library(rvest)
library(R2HTML)
k2<-read_html("https://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml")

#This works
df1<-k2 %>%
html_nodes("a")%>%
html_attr("href")

head(df1,5)
[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

# But, I want to save HTML file now in my working directory and work later
HTML(k2,"k2") #Later I can work with this
rm(k2,df1)
#Now extract webpages by reading back k2 html file
#This doesn't work
k2<-read_html("k2") 

df1<-k2 %>%
html_nodes("a")%>%
html_attr("href")

df1
character(0)

更新:

#I have following versions of packages loaded: 
lapply(c("rvest","R2HTML","XML2","XML"),packageVersion)
[[1]]
[1] ‘0.2.0.9000’

[[2]]
[1] ‘2.3.1’

[[3]]
[1] ‘0.1.1’

[[4]]
[1] ‘3.98.1.2’

我使用的是Windows 8,R 3.2.1。和RStudio 0.99.441。

1 个答案:

答案 0 :(得分:4)

R2HTML包似乎只是capture.out在XML对象上,然后将其写回磁盘。这似乎不是将HTML / XML数据保存回磁盘的有效方法。两者可能不同的原因是XML数据的打印方式与xml2数据不同。您可以定义一个函数来调用as.character()而不是依赖capture.output

HTML.xml_document<-function(x, ...) HTML(as.character(x),...)

或者您可能完全跳过R2HTML并直接用xml2写出write_xml数据。

也许最好的方法是首先下载文件,然后导入它。

download.file("http://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml", "local.html")
k2 <- read_html("local.html")