R

时间:2016-01-21 15:21:08

标签: html xml r web-scraping

我需要从网页获取一些数据。我试图使用R软件进行提取。

因为信息在几页中首先我写这段代码:

require(XML)
contador<-c(1:200)
for(i in contador){
 myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="")
}

其次,我使用以下代码阅读web_url:

web_url<-getURL(myURL)
web_url<-readLines(tc<-textConnection(web_url));close(tc)
webtree<-htmlTreeParse(web_url,error=function(...){})
body<-webtree$children$html$children$body
body

然而,当我执行以下命令时,我得到一个错误:

precio<-xpathSApply(body,"//li[@class='label label-secondary text-bold']",xmlValue)

Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x3C 0x2F 0x64
Sequence ']]>' not allowed in content
Sequence ']]>' not allowed in content
internal error: detected an error in element content

我尝试了不同的选择,但我不能废弃这些信息。

Tx征求意见!

1 个答案:

答案 0 :(得分:2)

我猜你的xpath坏了。 假设您想使用class='label label-secondary text-bold'读取范围,可以使用//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]作为xpath。

通过rvest

阅读
require(rvest)
i <- 1
myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="")
doc <- read_html(myURL)
doc %>% html_nodes(xpath='//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]') %>% html_text()

你得到了

 [1] "51.000 €"  "11.000 €"  "50.000 €"  "25.900 €"  "48.000 €"  "100.000 €" "60.000 €"  "25.000 €"  "20.888 €" 
[10] "29.999 €"  "26.000 €"  "11.000 €"  "42.500 €"  "12.000 €"  "41.000 €"  "30.500 €"  "40.000 €" 

您可以通过lapply循环执行此操作,如下所示:

doc <- lapply(1:10, function(x, base_url){
  read_html(paste0(base_url,x))
}, "http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=")

lapply(doc, . %>% html_nodes(xpath='//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]') %>% html_text())

它为您提供了包含文本的列表