R / Python网站抓取网站

时间:2014-05-12 08:02:57

标签: python r web-scraping

我想尝试总结一下Alsop网站上提供的数据(http://www.auction.co.uk/residential/onlineCatalogue.asp

理想情况下,我希望最终得到一个data.frame来自网站的以下字段。

批号,类型,位置/完整地址,指导价格,卧室数量,任何照片的网址。

我尝试使用谷歌浏览器来检查元素和htmlParse(通常是链接)但我得到每个批号的相同网址,即http://www.auction.co.uk/residential/LotDetails.asp?A=877&MP=24&ID=877000001&S=L&O=A

所以对我来说,我有点难过,因为我通常的抓取网站寻找链接的方法不再适用。

我偏爱R,但要了解Python是否更有用,并且愿意接受有关如何实现这一目标的建议。

1 个答案:

答案 0 :(得分:1)

您可以使用selenium获取数据。

require(RSelenium)
RSelenium::startServer()
Sys.sleep(5)
appUrl <- "http://www.auction.co.uk/residential/onlineCatalogue.asp"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.auction.co.uk/residential/onlineCatalogue.asp")
webElem <- remDr$findElement("css selector", '[href="onlineCatalogue.asp"]')
# check Element
webElem$highlightElement()
# click link
webElem$clickElement()
# get the pages to click thru
webElems <- remDr$findElements("css selector", "#Table7 a[href]")
appUrl <- c(appUrl, sapply(webElems, function(x){x$getElementAttribute("href")[[1]]}))
out <- lapply(appUrl, function(x){
  remDr$navigate(x)
  # get table data
  webElem <- remDr$findElement("id", "Table6")
  # get table html
  appData <- webElem$getElementAttribute("outerHTML")[[1]]
}
)
remDr$close()
remDr$closeServer()

现在我们可以处理html

# Process html Table
asDF <- lapply(out, function(x){
  appData <- x
  xData <- htmlParse(appData)
  require(selectr)
  lotAndLoc <- querySelectorAll(xData, "a.tooltip")
  alsopLot <- lapply(lotAndLoc[c(T,F)], function(x){
    lot <- getNodeSet(x, ".//span[@class = 'lotnum']")
    lot <- xmlValue(lot[[1]])
    img <- getNodeSet(x, ".//img")
    img <- xmlGetAttr(img[[1]], "src")
    data.frame(lot = lot, img = img)
  })
  alsopLot <- do.call(rbind.data.frame, alsopLot)
  alsopType <- xpathSApply(xData, "//tr/td[2]", xmlValue)[-1]
  alsopPrice <- xpathSApply(xData, "//tr/td[4]", xmlValue)[-1]
  alsopPrice <- gsub("ÂÂ", "", alsopPrice)
  alsopAddr <- xpathSApply(xData, "//tr/td[3]/*//span[@class='text']", function(x){
    Addr <- getChildrenStrings(x)[names(getChildrenStrings(x)) %in% c("text", "span")]
    Addr <- gsub("\\n\\s*", "", Addr)
    Addr <- Addr[Addr != ""]
    paste(Addr, collapse = "~")
  })

  alsopDf <- data.frame(type = alsopType, price = alsopPrice, address = alsopAddr)
  alsopDf <- cbind.data.frame(alsopLot, alsopDf)
  alsopDf
}
)
asDF <- do.call(rbind.data.frame, asDF)

您需要整理地址,但其余数据是您想要的

> head(asDF)
  lot                                                                   img
1   1 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg
2   2 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp2.jpg
3   3 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp3.jpg
4   4 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp4.jpg
5   5 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp5.jpg
6   6 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp6.jpg
                            type               price
1        VACANT - Leasehold Flat           £225,000+
2        VACANT - Leasehold Flat           £160,000+
3     VACANT - Freehold Building           £250,000+
4        VACANT - Leasehold Flat           £180,000+
5                 Freehold House           £180,000+
6 INVESTMENT - Freehold Building £110,000 - £120,000
                                                                  address
1 1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR
2                                   2~London W3~17 York Road~Acton~W3 6TS
3                 3~London SE27~23 Thurlestone Road~West Norwood~SE27 0PE
4             4~London N16~Flat G~74 Darenth Road~Stoke Newington~N16 6ED
5                              5~Ilford~11 Cavenham Gardens~Essex~IG1 1XX
6                                  6~Ilford~52 Balfour Road~Essex~IG1 4JG

数据框asDF具有所需的手数:

> str(asDF)
'data.frame':   347 obs. of  5 variables:
 $ lot    : Factor w/ 347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ img    : Factor w/ 347 levels "http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ type   : Factor w/ 102 levels "Freehold Building",..: 30 30 23 30 2 5 23 1 1 19 ...
 $ price  : Factor w/ 151 levels "£1.25M - £1.5M",..: 31 19 33 21 21 9 54 68 68 68 ...
 $ address: Factor w/ 347 levels "1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR",..: 1 14 27 38 49 60 71 82 94 2 ...