使用R和GET功能从网站查询数据

时间:2016-04-18 19:35:08

标签: html r rvest

我是网络抓取的新手,我需要下载查询后出现几个问题的数据。这意味着我需要填充第一页中的两个字段,然后在粗体中填写文本,然后以大写字母标识数据表,并下载它。

我开始使用GET函数并将所需名称作为列表添加到“query”参数中。然而,尽管我是一个老R用户,我甚至无法破译我得到的错误。

GET("http://apps.kew.org/wcsp/advsearch.do;jsessionid=15925570A99B794122939889DE7DCDBC",path = "search", query =list(Genus="Imperata",Species="cylindrica"))


Response[http://apps.kew.org/search;jsessionid=15925570A99B794122939889DE7DCDBC?      Genus=Imperata&Species=cylindrica]  

日期:2016-04-18 18:29
  状态:404
  内容类型:text / html;字符集= ISO-8859-1
  尺寸:445 B
    
    
    404未找到
    
    

未找到


    

请求的网址/搜索;未找到jsessionid = 15925570A99B794122939889DE7DCDBC ...
    

另外,403禁止了     尝试使用ErrorDocument处理请求时遇到错误。


    
    位于apps.kew.org的Apache / 2.2.3(Red Hat)服务器端口80

1 个答案:

答案 0 :(得分:0)

它可能无效,因为它是POST请求与GET请求(您可以使用我的curlconverter包来帮助处理这些“隐藏”API,顺便说一下:

library(httr)
library(rvest)

res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do", 
           body = list(page = "advancedSearch", 
                       AttachmentExist = "", 
                       family = "", 
                       placeOfPub = "", 
                       genus = "Imperata", 
                       yearPublished = "", 
                       species = "cylindrica", 
                       author = "", 
                       infraRank = "", 
                       infraEpithet = "", 
                       selectedLevel = "cont"), 
           encode = "form") 


pg <- content(res, as="parsed")

html_text(html_nodes(pg, "a.onwardnav"))

##  [1] "Imperata cylindrica (L.) P.Beauv., Ess. Agrostogr.: 165 (1812)."                                                
##  [2] "Imperata cylindrica var. africana (Andersson) C.E.Hubb., Joint Publ. Imp. Agric. Bur. 7: 10 (1944)."            
##  [3] "Imperata cylindrica var. condensata (Steud.) Hack., Anales Mus. Nac. Hist. Nat. Buenos Aires 21: 9 (1911)."     
##  [4] "Imperata cylindrica var. europaea (Andersson) Asch. & Graebn., Syn. Mitteleur. Fl. 2(1): 37 (1898)."            
##  [5] "Imperata cylindrica subsp. koenigii (Retz.) Masamura & Yanagih., Trans. Nat. Hist. Soc. Formosa 31: 326 (1941)."
##  [6] "Imperata cylindrica subvar. koenigii (Retz.) T.Durand & Schinz, Consp. Fl. Afric. 5: 694 (1894)."               
##  [7] "Imperata cylindrica var. koenigii (Retz.) Pilg., Fragm. Fl. Philipp. 1: 137 (1904)."                            
##  [8] "Imperata cylindrica var. latifolia (Hook.f.) C.E.Hubb., Joint Publ. Imp. Agric. Bur. 7: 14 (1944)."             
##  [9] "Imperata cylindrica var. major (Nees) C.E.Hubb., Grasses Mauritius: 96 (1940)."                                 
## [10] "Imperata cylindrica var. mexicana (Rupr. ex Galeotti) D.B.Ward, Novon 14: 368 (2004)."                          
## [11] "Imperata cylindrica f. pallida Honda, J. Fac. Sci. Univ. Tokyo, Sect. 3, Bot. 3: 374 (1930)."                   
## [12] "Imperata cylindrica var. parviflora Batt. & Trab., Bull. Soc. Bot. France 53: 32 (1906)."                       
## [13] "Imperata cylindrica var. pedicellata (Steud.) Debeaux, Actes Soc. Linn. Bordeaux 32: 52 (1878)."                
## [14] "Imperata cylindrica var. thunbergii (Retz.) T.Durand & Schinz, Consp. Fl. Afric. 5: 693 (1894), nom. superfl."  

lnks <- html_attr(html_nodes(pg, "a.onwardnav"), "href")

res2 <- GET(sprintf("http://apps.kew.org%s", lnks[1]))
pg2 <- content(res2, as="parsed")

trimws(gsub("[[:space:]]+", " ", html_text(html_nodes(pg2, "th + td"))))

## [1] "Medit. to Africa and Afghanistan 12 BAL COR FRA POR SAR SPA 13 ALB BUL GRC ITA KRI SIC TUE YUG 20 ALG EGY LBY MOR TUN 21 CNY CVI MDR 22 BEN BKN GAM GHA GNB GUI IVO LBR MLI NGA NGR SEN SIE TOG 23 BUR CAF CMN CON EQG GAB GGI RWA ZAI 24 CHA ETH SOC SUD 25 KEN TAN UGA 26 ANG MLW MOZ ZAM ZIM 27 BOT CPP LES NAM NAT OFS SWZ TVL 29 COM MAU MDG (32) kaz kgz tkm tzk uzb (33) ncs tcs 34 AFG CYP EAI IRN IRQ LBS PAL SIN TUR 35 KUW OMA? SAU YEM (36) chc chh chi chm chn chs cht chx (38) jap kor nns oga tai (40) ass ban ehm ind nep pak srl whm (41) and cbd lao mya ncb scs tha vie (42) bor cki jaw lsi mly mol phi sul sum xms (43) bis nwg sol (50) nfk (51) nzn (60) fij nwc sam ton van wal (62) mrn (73) ore (77) tex (78) ala fla geo lou msi sca vrg (79) mxs mxt"
## [2] "Hemicr. or rhizome geophyte"    
## [3] "Poaceae"                                     
## [4] "W.D.Clayton, R.Govaerts, K.T.Harman, H.Williamson & M.Vorontsova"