刮刮javascript网站

时间:2014-03-05 17:06:17

标签: javascript xml r web-scraping screen-scraping

我能够从基本的html页面中删除数据,但是我在抓取下面的网站时遇到了麻烦。看起来数据是通过javascript呈现的,我不知道如何解决这个问题。如果可能的话,我更愿意使用R来刮,但也可以使用Python。

任何想法/建议?

编辑:我需要获取每个列表的年份/制造商/型号,S / N,价格,位置和简短描述(以“拍卖:”开头)。

http://www.machinerytrader.com/list/list.aspx?bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial

2 个答案:

答案 0 :(得分:3)

library(XML) 
library(relenium)

##downloading website
website<- firefoxClass$new() 
website$get("http://www.machinerytrader.com/list/list.aspx?pg=1&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial") 
doc <- htmlParse(website$getPageSource())

##reading tables and binding the information
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
data<-do.call("rbind", tables[seq(from=8, to=56, by=2)])
data<-cbind(data, sapply(lapply(tables[seq(from=9, to=57, by=2)],  '[[', i=2), '[', 1))
rownames(data)<-NULL
names(data) <- c("year.man.model", "s.n", "price", "location", "auction")

这将为您提供第一页所需的内容(此处仅显示前两行):

head(data,2)
      year.man.model      s.n      price location                                               auction
1 1972 AMERICAN 5530 GS14745W US $50,100       MI                   Auction: 1/9/2013; 4,796 Hours;  ..
2 AUSTIN-WESTERN 307      307  US $3,400       MT Auction: 12/18/2013;  AUSTIN-WESTERN track excavator.

要获取所有网页,只需循环覆盖它们,粘贴地址中的pg=i

答案 1 :(得分:2)

使用Relenium

require(relenium) # More info: https://github.com/LluisRamon/relenium
require(XML)
firefox <- firefoxClass$new() # init browser
res <- NULL
pages <- 1:2
for (page in pages) {
  url <- sprintf("http://www.machinerytrader.com/list/list.aspx?pg=%d&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial", page)
  firefox$get(url) 
  doc <- htmlParse(firefox$getPageSource())
  res <- rbind(res, 
               cbind(year_manu_model = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[1]', xmlValue),
                     sn = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[2]', xmlValue),
                     price = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[3]', xmlValue),
                     loc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[4]', xmlValue),
                     auc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-9) = "tblContent"]/tbody/tr/td[2]', xmlValue))
  )
}
sapply(as.data.frame(res), substr, 0, 30)                        
#      year_manu_model                  sn               price         loc   auc                               
# [1,] " 1972 AMERICAN 5530"            "GS14745W"       "US $50,100"  "MI " "\n\t\t\t\t\tAuction: 1/9/2013; 4,796" 
# [2,] " AUSTIN-WESTERN 307"            "307"            "US $3,400"   "MT " "\n\t\t\t\t\tDetails & Photo(s)Video(" 
# ...