使用POST在R中刮擦动态表

时间:2017-06-01 16:49:56

标签: r post web-scraping

我正在尝试使用R来抓取this table到目前为止,我已经设法只使用下面的代码获得了27行。我希望得到所有条目,理想情况下,修改请求,以便我可以选择某些年份等。关于SO的其他问题针对略有不同的情况,我想将其保留在rvest-xml2-httr世界中,如果可能的话。

url <- "http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/"


view <- httr::POST(url) %>% 
  xml2::read_html() %>% 
  rvest::html_nodes("input[name='__VIEWSTATE']") %>% 
  rvest::html_attr("value")

param <- list(`__EVENTTARGET` =     "",
               `__EVENTARGUMENT` =  "",
               `__VIEWSTATE` = view,
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$RefreshButton` = "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year` = "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber` = "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName` =   "",
               `ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox` = "10000",
               `ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState` = "",
               `ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState` = "",
               `ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState` =    "",
               `__VIEWSTATEGENERATOR` = "CA0B0334")

request <- httr::POST(url,
                       body = param,
                       encode = 'form') %>% 
  xml2::read_html() %>% 
  rvest::html_table(fill = T)

tib <- request[[1]]

> dim(tib)
[1] 27  9

1 个答案:

答案 0 :(得分:2)

相关表格有一个“导出到CSV”链接:

table/link image

如果单击它,您将直接获得6.36MB CSV文件,这很好。我假设您需要/想要以编程方式执行此操作,因此这对我有用:

以编程方式“点击导出到CSV”

的步骤
  1. 我使用的是Firefox,但Chrome也有类似的功能:Inspector。我打开它( Ctrl - Shift - I )并转到“网络”标签。
  2. 点击“导出到CSV”按钮。您应该在检查器框架中看到一个新的“POST”行。什么时候完成......
  3. 右键单击“POST”行并选择“复制POST数据”;这提供了:

    __EVENTTARGET
    __EVENTARGUMENT
    __VIEWSTATE=...
    ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton=+
    ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year
    ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber
    ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName
    ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox=20
    ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState
    ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState
    ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState
    __VIEWSTATEGENERATOR=CA0B0334
    

    (我将long base64-string替换为“...”。)值得注意的是第四行,以$ExportToCsvButton=+结尾。这是您需要包含在POST数据中的参数(param)。

  4. 使用上面的代码并包含定义param,继续:

    param$`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton` <- "+"
    request <- httr::POST(url, body = param, encode = 'form')
    
  5. 你现在有:

    request
    # Response [http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/]
    #   Date: 2017-06-01 18:09
    #   Status: 200
    #   Content-Type: text/csv; charset-UTF-8;
    #   Size: 6.36 MB
    # <U+FEFF>"Year","Area Number","Area Name","Carcass Size","Harvest Date","Location"
    # "2000","101","LAKE PIERCE","11 ft. 5 in.","09-22-2000",""
    # "2000","101","LAKE PIERCE","9 ft. 0 in.","10-02-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 10 in.","10-06-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 0 in.","09-25-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 0 in.","10-07-2000",""
    # "2000","101","LAKE PIERCE","8 ft. 0 in.","09-22-2000",""
    # "2000","101","LAKE PIERCE","7 ft. 2 in.","09-21-2000",""
    # "2000","101","LAKE PIERCE","7 ft. 1 in.","09-21-2000",""
    # "2000","101","LAKE PIERCE","6 ft. 11 in.","09-25-2000",""
    # ...
    

    附注:网站以<U+FEFF>(一个unicode字符)启动文件。这会抛弃read.csv,并为您提供X.U.FEFF.Year的列名,完全是装饰性的。

    保存到文件

    如果您不关心建议的文件名,可以直接执行

    write(as.character(request), file="quux.csv")
    

    如果您想使用网站为其建议的文件名,您可以通过以下方式找到它:

    httr::headers(request)$`content-disposition`
    # [1] "inline;filename=\"FWCAlligatorHarvestData.csv\""
    

    解析应该是直截了当的。

    即时消费

    如果您不想/需要保存到中间文件,您可以随时立即使用它:

    head(read.csv(textConnection(as.character(request))))
    # Invalid encoding : defaulting to UTF-8.
    #   X.U.FEFF.Year Area.Number   Area.Name Carcass.Size Harvest.Date Location
    # 1          2000         101 LAKE PIERCE 11 ft. 5 in.   09-22-2000         
    # 2          2000         101 LAKE PIERCE  9 ft. 0 in.   10-02-2000         
    # 3          2000         101 LAKE PIERCE 8 ft. 10 in.   10-06-2000         
    # 4          2000         101 LAKE PIERCE  8 ft. 0 in.   09-25-2000         
    # 5          2000         101 LAKE PIERCE  8 ft. 0 in.   10-07-2000         
    # 6          2000         101 LAKE PIERCE  8 ft. 0 in.   09-22-2000