Question

我需要从网页获取一些数据。我试图使用R软件进行提取。

因为信息在几页中首先我写这段代码：

require(XML)
contador<-c(1:200)
for(i in contador){
 myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="")
}

其次，我使用以下代码阅读web_url：

web_url<-getURL(myURL)
web_url<-readLines(tc<-textConnection(web_url));close(tc)
webtree<-htmlTreeParse(web_url,error=function(...){})
body<-webtree$children$html$children$body
body

然而，当我执行以下命令时，我得到一个错误：

precio<-xpathSApply(body,"//li[@class='label label-secondary text-bold']",xmlValue)

Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x3C 0x2F 0x64
Sequence ']]>' not allowed in content
Sequence ']]>' not allowed in content
internal error: detected an error in element content

我尝试了不同的选择，但我不能废弃这些信息。

Tx征求意见！

Answer 1

我猜你的xpath坏了。假设您想使用class='label label-secondary text-bold'读取范围，可以使用//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]作为xpath。

通过rvest

阅读

require(rvest)
i <- 1
myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="")
doc <- read_html(myURL)
doc %>% html_nodes(xpath='//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]') %>% html_text()

你得到了

 [1] "51.000 €"  "11.000 €"  "50.000 €"  "25.900 €"  "48.000 €"  "100.000 €" "60.000 €"  "25.000 €"  "20.888 €" 
[10] "29.999 €"  "26.000 €"  "11.000 €"  "42.500 €"  "12.000 €"  "41.000 €"  "30.500 €"  "40.000 €"

您可以通过lapply循环执行此操作，如下所示：

doc <- lapply(1:10, function(x, base_url){
  read_html(paste0(base_url,x))
}, "http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=")

lapply(doc, . %>% html_nodes(xpath='//span[contains(concat( " ", @class, " " ), concat( " ", "text-bold", " " ))]') %>% html_text())

它为您提供了包含文本的列表

R

1 个答案: