我想尝试总结一下Alsop网站上提供的数据(http://www.auction.co.uk/residential/onlineCatalogue.asp)
理想情况下,我希望最终得到一个data.frame
来自网站的以下字段。
批号,类型,位置/完整地址,指导价格,卧室数量,任何照片的网址。
我尝试使用谷歌浏览器来检查元素和htmlParse
(通常是链接)但我得到每个批号的相同网址,即http://www.auction.co.uk/residential/LotDetails.asp?A=877&MP=24&ID=877000001&S=L&O=A
所以对我来说,我有点难过,因为我通常的抓取网站寻找链接的方法不再适用。
我偏爱R,但要了解Python是否更有用,并且愿意接受有关如何实现这一目标的建议。
答案 0 :(得分:1)
您可以使用selenium获取数据。
require(RSelenium)
RSelenium::startServer()
Sys.sleep(5)
appUrl <- "http://www.auction.co.uk/residential/onlineCatalogue.asp"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.auction.co.uk/residential/onlineCatalogue.asp")
webElem <- remDr$findElement("css selector", '[href="onlineCatalogue.asp"]')
# check Element
webElem$highlightElement()
# click link
webElem$clickElement()
# get the pages to click thru
webElems <- remDr$findElements("css selector", "#Table7 a[href]")
appUrl <- c(appUrl, sapply(webElems, function(x){x$getElementAttribute("href")[[1]]}))
out <- lapply(appUrl, function(x){
remDr$navigate(x)
# get table data
webElem <- remDr$findElement("id", "Table6")
# get table html
appData <- webElem$getElementAttribute("outerHTML")[[1]]
}
)
remDr$close()
remDr$closeServer()
现在我们可以处理html
了# Process html Table
asDF <- lapply(out, function(x){
appData <- x
xData <- htmlParse(appData)
require(selectr)
lotAndLoc <- querySelectorAll(xData, "a.tooltip")
alsopLot <- lapply(lotAndLoc[c(T,F)], function(x){
lot <- getNodeSet(x, ".//span[@class = 'lotnum']")
lot <- xmlValue(lot[[1]])
img <- getNodeSet(x, ".//img")
img <- xmlGetAttr(img[[1]], "src")
data.frame(lot = lot, img = img)
})
alsopLot <- do.call(rbind.data.frame, alsopLot)
alsopType <- xpathSApply(xData, "//tr/td[2]", xmlValue)[-1]
alsopPrice <- xpathSApply(xData, "//tr/td[4]", xmlValue)[-1]
alsopPrice <- gsub("ÂÂ", "", alsopPrice)
alsopAddr <- xpathSApply(xData, "//tr/td[3]/*//span[@class='text']", function(x){
Addr <- getChildrenStrings(x)[names(getChildrenStrings(x)) %in% c("text", "span")]
Addr <- gsub("\\n\\s*", "", Addr)
Addr <- Addr[Addr != ""]
paste(Addr, collapse = "~")
})
alsopDf <- data.frame(type = alsopType, price = alsopPrice, address = alsopAddr)
alsopDf <- cbind.data.frame(alsopLot, alsopDf)
alsopDf
}
)
asDF <- do.call(rbind.data.frame, asDF)
您需要整理地址,但其余数据是您想要的
> head(asDF)
lot img
1 1 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg
2 2 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp2.jpg
3 3 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp3.jpg
4 4 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp4.jpg
5 5 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp5.jpg
6 6 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp6.jpg
type price
1 VACANT - Leasehold Flat £225,000+
2 VACANT - Leasehold Flat £160,000+
3 VACANT - Freehold Building £250,000+
4 VACANT - Leasehold Flat £180,000+
5 Freehold House £180,000+
6 INVESTMENT - Freehold Building £110,000 - £120,000
address
1 1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR
2 2~London W3~17 York Road~Acton~W3 6TS
3 3~London SE27~23 Thurlestone Road~West Norwood~SE27 0PE
4 4~London N16~Flat G~74 Darenth Road~Stoke Newington~N16 6ED
5 5~Ilford~11 Cavenham Gardens~Essex~IG1 1XX
6 6~Ilford~52 Balfour Road~Essex~IG1 4JG
数据框asDF
具有所需的手数:
> str(asDF)
'data.frame': 347 obs. of 5 variables:
$ lot : Factor w/ 347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ img : Factor w/ 347 levels "http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg",..: 1 2 3 4 5 6 7 8 9 10 ...
$ type : Factor w/ 102 levels "Freehold Building",..: 30 30 23 30 2 5 23 1 1 19 ...
$ price : Factor w/ 151 levels "£1.25M - £1.5M",..: 31 19 33 21 21 9 54 68 68 68 ...
$ address: Factor w/ 347 levels "1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR",..: 1 14 27 38 49 60 71 82 94 2 ...