使用R从aspx网站上刮擦

时间:2013-05-30 01:09:27

标签: r web-scraping

我正在尝试使用R在网站上抓取数据来完成任务。

  1. 我想浏览以下页面中的每个链接: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House条例草案

  2. 仅选择当前状态显示“已发送到州长”的项目。例如,http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013

  3. 然后在STATUS TEXT中删除单元格以获取以下子句“通过最终阅读”。例如:在SD 2中修改了最终解读,其中代表Fale,Jordan,Tsuji投票赞成保留;代表Cabanilla,Morikawa,Oshiro,Tokioka投票否(4)并且没有原谅(0)。

  4. 我尝试使用包Rcurl和XML(在R中)的先前示例,但我不知道如何正确使用它们用于aspx站点。所以我希望拥有的是:1。关于如何构建这样的代码的一些建议。 2.并建议如何学习执行此类任务所需的知识。

    感谢您的帮助,

    汤姆

2 个答案:

答案 0 :(得分:5)

require(httr)
require(XML)

basePage <- "http://capitol.hawaii.gov"

h <- handle(basePage)

GET(handle = h)

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")

# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')

appUrls <- resUrls[include]

# look at just the first

res <- GET(handle = h, path = appUrls[1])

resXML <- htmlParse(content(res, as = "text"))


xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
 Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
 Tokioka voting no (4) and none excused (0)."

让包httr通过设置handle来处理所有后台工作。

如果您想要遍历所有92个链接:

 # get all the links returned as a list (will take sometime)
 # print statement included for sanity
 res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
                                   GET(handle = h, path = x)})
 resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
 appString <- sapply(resXML, function(x){
                   xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
                      })


 head(appString)

>  head(appString)
$href
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                                  
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  1 Excused: Ige."                    
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                        
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."  
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."

答案 1 :(得分:0)

我正在尝试使用.aspx阅读页面,但我遵循了您的路径,但是我无法获取R中的数据。

这是查询

res <-GET(句柄= h,路径=“ https://www.morningstar.in/mutualfunds/f0gbr06rnd/hdfc-medium-term-debt-plan-growth/detailed-portfolio.aspx”)

解析内容排名前10位的

resXML <-htmlParse(content(res,as =“ text”))

resTable <-getNodeSet(resXML,'/ * [@ id =“ quotePageContent”] / div / div / div [2] / div / div [2] / div [2] / div [1] / div [ 1] / table / tbody / tr [12] / tr')

appRows <-sapply(resTable,xmlValue)

include <-grepl(“十大控股公司”,appRows)

结果如下,

  

resXML <-htmlParse(content(res,as =“ text”))

     

resTable <-getNodeSet(resXML,'/ * [@ id =“ quotePageContent”] / div / div / div [2] / div / div [2] / div [2] / div [1] / div [ 1] / table / tbody / tr [12] / tr')

     

appRows <-sapply(resTable,xmlValue)

     

include <-grepl(“十大控股公司”,appRows)

     

包括

逻辑(0)

  

appRows   list()   表格   NULL