我需要从网页获取一些数据。我试图使用R软件进行提取。
因为信息在几页中首先我写这段代码:
require(XML)
contador<-c(1:200)
for(i in contador){
myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="")
}
其次,我使用以下代码阅读web_url:
web_url<-getURL(myURL)
web_url<-readLines(tc<-textConnection(web_url));close(tc)
webtree<-htmlTreeParse(web_url,error=function(...){})
body<-webtree$children$html$children$body
body
然而,当我执行以下命令时,我得到一个错误:
precio<-xpathSApply(body,"//li[@class='label label-secondary text-bold']",xmlValue)
Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x3C 0x2F 0x64
Sequence ']]>' not allowed in content
Sequence ']]>' not allowed in content
internal error: detected an error in element content
我尝试了不同的选择,但我不能废弃这些信息。
Tx征求意见!
答案 0 :(得分:2)
我猜你的xpath坏了。
假设您想使用class='label label-secondary text-bold'
读取范围,可以使用//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]
作为xpath。
通过rvest
require(rvest)
i <- 1
myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="")
doc <- read_html(myURL)
doc %>% html_nodes(xpath='//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]') %>% html_text()
你得到了
[1] "51.000 €" "11.000 €" "50.000 €" "25.900 €" "48.000 €" "100.000 €" "60.000 €" "25.000 €" "20.888 €"
[10] "29.999 €" "26.000 €" "11.000 €" "42.500 €" "12.000 €" "41.000 €" "30.500 €" "40.000 €"
您可以通过lapply
循环执行此操作,如下所示:
doc <- lapply(1:10, function(x, base_url){
read_html(paste0(base_url,x))
}, "http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=")
lapply(doc, . %>% html_nodes(xpath='//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]') %>% html_text())
它为您提供了包含文本的列表