如何从XHTML收集多个URL

时间:2015-08-05 08:23:34

标签: r xpath web-scraping

我对XPath和R一般都很陌生。所以我希望我的问题不是太愚蠢。

我想从网页收集多个网址(搜索结果): http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=&CompanyType=&PageNum=

Information:Using javac 1.7.0 to compile java sources
Information:Compilation completed with 6 errors and 0 warnings in 23 sec
Information:6 errors
Information:0 warnings
G:\pgWorkspace\Pocketglobe\AndroidManifest.xml
    Error:Error:line (35)Android Resource Packaging: [pgWorkspace] G:\pgWorkspace\Pocketglobe\AndroidManifest.xml:35: error: Error: No resource found that matches the given name (at 'icon' with value '@drawable/app_icon').
    Error:Error:line (35)Android Resource Packaging: [pgWorkspace] G:\pgWorkspace\Pocketglobe\AndroidManifest.xml:35: error: Error: No resource found that matches the given name (at 'label' with value '@android:string/app_name').
    Error:Error:line (35)Android Resource Packaging: [pgWorkspace] G:\pgWorkspace\Pocketglobe\AndroidManifest.xml:35: error: Error: No resource found that matches the given name (at 'theme' with value '@style/AppTheme').
    Error:Error:line (109)Android Resource Packaging: [pgWorkspace] G:\pgWorkspace\Pocketglobe\AndroidManifest.xml:109: error: Error: No resource found that matches the given name (at 'value' with value '@string/app_id').
    Error:Error:line (112)Android Resource Packaging: [pgWorkspace] G:\pgWorkspace\Pocketglobe\AndroidManifest.xml:112: error: Error: No resource found that matches the given name (at 'value' with value '@integer/google_play_services_version').
    Error:Error:line (118)Android Resource Packaging: [pgWorkspace] G:\pgWorkspace\Pocketglobe\AndroidManifest.xml:118: error: Error: No resource found that matches the given name (at 'value' with value '@android:string/google_APIKEY')

我使用的代码如下:

<h2>
    <a id="resultsList_rptSearchResults_ctl00_lnkJobTitle" property="title" href="/JobSearch/JobDetails.aspx?JobId=63057920&amp;Keywords=Leadership&amp;LTxt=&amp;Radius=10&amp;RateType=0&amp;JobType1=&amp;CompanyType=&amp;PageNum=2">Adult Social Care - Senior Leadership (Mental Health)</a>
</h2>

print(length(allLinks))的结果主要是: [[数]] 空

我尝试了多个xpath命令(至少我认为这是问题所在),包括代码中显示的命令。我也试过这个

pageNum <- seq(1:10)
url <- paste0("http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=&CompanyType=&PageNum=") 

urls <- paste0(url, pageNum) 
allLinks <- list() 
for (url in urls) { 
  doc <- getURLContent(url)[[1]]
  xmlDoc <- htmlParse(doc) 
  xPath <- "//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'hd', ))]"
  linkToArticle <- XML::getNodeSet(xmlDoc, xPath) 
  linkUrls <- sapply(linkToArticle, function (x) XML::xmlGetAttr(x, "href"))  
  allLinks <- c(allLinks, linkUrls) }

print(length(allLinks))

但它只给出了1-10页的每个URL的结果x。

如果有人能指引我朝着正确的方向前进,那就太棒了。

2 个答案:

答案 0 :(得分:1)

强制性的Hadleyverse版本:

rs.slaveOk()

这使用CSS选择器而不是XPath,而library(rvest) library(httr) library(pbapply) base_url <- "http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=&CompanyType=&PageNum=%d" unlist(pblapply(1:10, function(i) { # grab the page pg <- html_session(sprintf(base_url, i), user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.15 Safari/537.36")) # extract the links pg %>% html_nodes("a[id^='resultsList_rptSearchResults'][href^='/JobSearch']") %>% html_attr("href") })) -> links 可以免费获得进度条。我需要使用pblapply,因为它限制了我(403)。

答案 1 :(得分:0)

你几乎就在那里

library(RCurl);library(XML)

pageNum <- seq(1:10)
url <- paste0("http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=&CompanyType=&PageNum=") 
urls <- paste0(url, pageNum) 

allPages <- lapply(urls, function(x) getURLContent(x)[[1]])
xmlDocs <- lapply(allPages, function(x) XML::htmlParse(x))

ResultsPerPage <- 19

# Essentially this is the difference from your code
xPath <- paste0("//*[@id='resultsList_rptSearchResults_ctl", 
                ifelse(nchar(0:ResultsPerPage)==1, paste0("0", (0:ResultsPerPage)), (0:ResultsPerPage)),
               "_lnkJobTitle']")

linksToArticle <- unlist(lapply(xmlDocs, function(x) XML::getNodeSet(x, xPath)))
linkUrls <- lapply(linksToArticle, function (x) XML::xmlGetAttr(x, "href")) 

#Remove all objects except for linkUrls
rm(list=ls()[!(ls()=="linkUrls")])

length(linkUrls)
print(paste0("http://www.totaljobs.com", linkUrls))