在R中刮取URL目录ID

时间:2015-03-19 18:59:46

标签: r url web-scraping rvest

刮掉R中没有id号序列的URL目录的所有内容的最佳方法是什么?我想在http://www.metalmusicarchives.com/album/中获取所有内容,但该目录中的所有内容的URL格式为http://www.metalmusicarchives.com/album/[BAND NAME]/[ALBUM NAME]。 我试图说明他的专辑目录中的所有字符,但是

bandurls <- unlist(lapply(LETTERS, function(letter)  
xpathSApply(htmlParse(paste0("http://www.metalmusicarchives.com/ListArtistsAlpha.aspx?letter=", letter)), '//div[@class="artistsListContainer"]/ul/li/a', xmlGetAttr, "href") 
))
bands <- setNames(sub(".*/(.*)", "\\1", bandurls), bandurls)

albums <- sapply(bands, function(band) {
doc <- htmlParse(paste0("http://www.metalmusicarchives.com/artist/", band))
sapply(doc[paste0('//div[@class="discographyContainer"]/a[starts-with(@href,"/album/', band, '")]')], xmlGetAttr, "href")
})


URL <- sprintf("http://www.metalmusicarchives.com", albums)

METAL.SCRAPER <- function(ID) {
  PaGE <- try(html(sprintf(URL, ID)), silent=TRUE)
  if (inherits(PaGE, "try-error")) {
    data.frame(Band=character(0), Year=character(0), Tracklist=character(0),Lineup=character(0),
           Release=character(0), Genre=character(0), Rating=character(0))
  } else {
    data.frame(Band=PaGE %>% html_nodes(xpath='//head') %>% html_text(),
           Year=PaGE %>% html_nodes(xpath='//h3[1]') %>% html_text(),
           Tracklist=PaGE %>% html_nodes(xpath='//div[@id="albumInfosDetails"]') %>% html_text(),
           Lineup=PaGE %>% html_nodes(xpath='//div[@id="albumInfosDetails"]') %>% html_text(),
           Release=PaGE %>% html_nodes(xpath='//div[@id="albumInfosDetails"]') %>% html_text(),
           Genre=PaGE %>% html_nodes(xpath='//span[@id="ctl00_MainContentPlaceHolder_AlbumInfosRepeater_ctl00_FiledUnderLabel"]') %>% html_text(),
           Rating=PaGE %>% html_nodes(xpath='//span[@itemprop="average"]') %>% html_text(),
           stringsAsFactors=FALSE)
 }
}

Sys.sleep(2)

DaTa <- rbindlist(pblapply(URL, METAL.SCRAPER))

Warning messages:
1: In if (grepl("^http", x)) { ... :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("^http", x)) { ... :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("^http", x)) { ... :
  the condition has length > 1 and only the first element will be used

1 个答案:

答案 0 :(得分:1)

从理论上讲,这是抓住它的一种方法:

library(XML)
bandurls <- unlist(lapply(LETTERS, function(letter)  
  xpathSApply(htmlParse(paste0("http://www.metalmusicarchives.com/ListArtistsAlpha.aspx?letter=", letter)), '//div[@class="artistsListContainer"]/ul/li/a', xmlGetAttr, "href") 
))
bands <- setNames(sub(".*/(.*)", "\\1", bandurls), bandurls)
albums <- sapply(bands, function(band) {
  doc <- htmlParse(paste0("http://www.metalmusicarchives.com/artist/", band))
  sapply(doc[paste0('//div[@class="discographyContainer"]/a[starts-with(@href, "/album/', band, '")]')], xmlGetAttr, "href")
})
albums
# $`/artist/a-band-called-pain`
# [1] "/album/a-band-called-pain/broken-dreams" "/album/a-band-called-pain/broken-dreams"
# 
# $`/artist/a-band-of-orcs`
# [1] "/album/a-band-of-orcs/warchiefs-of-the-apocalypse(ep)"
# [2] "/album/a-band-of-orcs/warchiefs-of-the-apocalypse(ep)"
# [3] "/album/a-band-of-orcs/hall-of-the-frozen-dead(single)"
# [4] "/album/a-band-of-orcs/hall-of-the-frozen-dead(single)"
# ...

但是,在抓取网站之前,你应该这样做 询问网站站长是否允许这样做。 \ m /