如何使用rvest或httr从html表单获取表格?

时间:2016-09-26 21:12:00

标签: r post web-scraping html-form rvest

我正在使用R,版本3.3.1。我正在尝试从以下网站废弃数据:

http://plovila.pomorstvo.hr/

如您所见,它是一种HTML表单。我想选择"Tip objekta"(对象类型),例如“Jahta”(Yacht)并输入"NIB"(这是一个整数,例如93567)。你可以试试自己;只需选择“Jahta”并在NIB字段中键入93567。

方法为POST,类型为application/x-www-form-urlencoded。我尝试了3种不同的方法:使用rvest,POST(httr包)和postForm(Rcurl)。我的rvest代码是:

session <- html_session("http://plovila.pomorstvo.hr")
form <- html_form(session)[[1]]
form <- set_values(form,  `ctl00$Content_FormContent$uiTipObjektaDropDown` = 2,
                    `ctl00$Content_FormContent$uiOznakaTextBox` = "",
                    `ctl00$Content_FormContent$uiNibTextBox` = 93567)
x <- submit_form(session, form)

如果我运行此代码并获得200状态,但我不明白如何获取该表:

enter image description here

其他步骤是提交Detalji按钮并获取其他信息,但我无法从x提交输出中看到任何信息。

1 个答案:

答案 0 :(得分:3)

我使用curlconverter包将&#34;复制为cURL&#34;来自XHR POST请求的数据并将其自动转换为:

httr::VERB(verb = "POST", url = "http://plovila.pomorstvo.hr/", 
    httr::add_headers(Origin = "http://plovila.pomorstvo.hr", 
        `Accept-Encoding` = "gzip, deflate", 
        `Accept-Language` = "en-US,en;q=0.8", 
        `X-Requested-With` = "XMLHttpRequest", 
        Connection = "keep-alive", 
        `X-MicrosoftAjax` = "Delta=true", 
        Pragma = "no-cache", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.34 Safari/537.36", 
        Accept = "*/*", `Cache-Control` = "no-cache", 
        Referer = "http://plovila.pomorstvo.hr/", 
        DNT = "1"), httr::set_cookies(ASP.NET_SessionId = "b4b123vyqxnt4ygzcykwwvwr"), 
    body = list(`ctl00$uiScriptManager` = "ctl00$Content_FormContent$ctl00|ctl00$Content_FormContent$uiPretraziButton", 
        ctl00_uiStyleSheetManager_TSSM = ";|635908784800000000:d29ba49:3cef4978:9768dbb9", 
        `ctl00$Content_FormContent$uiTipObjektaDropDown` = "2", 
        `ctl00$Content_FormContent$uiImeTextBox` = "", 
        `ctl00$Content_FormContent$uiNibTextBox` = "93567", 
        `__EVENTTARGET` = "", `__EVENTARGUMENT` = "", 
        `__LASTFOCUS` = "", `__VIEWSTATE` = "/wEPDwUKMTY2OTIzNTI1MA9kFgJmD2QWAgIDD2QWAgIBD2QWAgICD2QWAgIDD2QWAmYPZBYIAgEPZBYCZg9kFgZmD2QWAgIBDxAPFgYeDURhdGFUZXh0RmllbGQFD05heml2VGlwT2JqZWt0YR4ORGF0YVZhbHVlRmllbGQFDElkVGlwT2JqZWt0YR4LXyFEYXRhQm91bmRnZBAVBAAHQnJvZGljYQVKYWh0YQbEjGFtYWMVBAEwATEBMgEzFCsDBGdnZ2cWAQICZAIBDw8WAh4HVmlzaWJsZWdkFgICAQ8PFgIfA2dkZAICDw8WAh8DaGQWAgIBDw8WBB4EVGV4dGUfA2hkZAIHDzwrAA4CABQrAAJkFwEFCFBhZ2VTaXplAgoBFgIWCw8CCBQrAAhkZGRkZDwrAAUBBAUHSWRVcGlzYTwrAAUBBAUISWRVbG9za2E8KwAFAQQFBlNlbGVjdGRlFCsAAAspelRlbGVyaWsuV2ViLlVJLkdyaWRDaGlsZExvYWRNb2RlLCBUZWxlcmlrLldlYi5VSSwgVmVyc2lvbj0yMDEzLjMuMTExNC40MCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj0xMjFmYWU3ODE2NWJhM2Q0ATwrAAcACyl1VGVsZXJpay5XZWIuVUkuR3JpZEVkaXRNb2RlLCBUZWxlcmlrLldlYi5VSSwgVmVyc2lvbj0yMDEzLjMuMTExNC40MCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj0xMjFmYWU3ODE2NWJhM2Q0ARYCHgRfZWZzZGQWBB4KRGF0YU1lbWJlcmUeBF9obG0LKwQBZGZkAgkPZBYCZg9kFgJmD2QWIAIBD2QWBAIDDzwrAAgAZAIFDzwrAAgAZAIDD2QWBAIDDzwrAAgAZAIFDzwrAAgAZAIFD2QWAgIDDzwrAAgAZAIHD2QWBAIDDzwrAAgAZAIFDzwrAAgAZAIJD2QWBAIDDzwrAAgAZAIFDzwrAAgAZAILD2QWBgIDDxQrAAI8KwAIAGRkAgUPFCsAAjwrAAgAZGQCBw8UKwACPCsACABkZAIND2QWBgIDDxQrAAI8KwAIAGRkAgUPFCsAAjwrAAgAZGQCBw8UKwACPCsACABkZAIPD2QWAgIDDxQrAAI8KwAIAGRkAhEPZBYGAgMPPCsACABkAgUPPCsACABkAgcPPCsACABkAhMPZBYGAgMPPCsACABkAgUPPCsACABkAgcPPCsACABkAhUPZBYCAgMPPCsACABkAhcPZBYGAgMPPCsACABkAgUPPCsACABkAgcPPCsACABkAhkPPCsADgIAFCsAAmQXAQUIUGFnZVNpemUCBQEWAhYLZGRlFCsAAAsrBAE8KwAHAAsrBQEWAh8FZGQWBB8GZR8HCysEAWRmZAIbDzwrAA4CABQrAAJkFwEFCFBhZ2VTaXplAgUBFgIWC2RkZRQrAAALKwQBPCsABwALKwUBFgIfBWRkFgQfBmUfBwsrBAFkZmQCHQ88KwAOAgAUKwACZBcBBQhQYWdlU2l6ZQIFARYCFgtkZGUUKwAACysEATwrAAcACysFARYCHwVkZBYEHwZlHwcLKwQBZGZkAiMPPCsADgIAFCsAAmQXAQUIUGFnZVNpemUCBQEWAhYLZGRlFCsAAAsrBAE8KwAHAAsrBQEWAh8FZGQWBB8GZR8HCysEAWRmZAILD2QWAmYPZBYCZg9kFgICAQ88KwAOAgAUKwACZBcBBQhQYWdlU2l6ZQIFARYCFgtkZGUUKwAACysEATwrAAcACysFARYCHwVkZBYEHwZlHwcLKwQBZGZkZIULy2JISPTzELAGqWDdBkCVyvvKIjo/wm/iG9PT1dlU", 
        `__VIEWSTATEGENERATOR` = "CA0B0334", 
        `__PREVIOUSPAGE` = "jGgYHmJ3-6da6PzGl9Py8IDr-Zzb75YxIFpHMz4WQ6iQEyTbjWaujGRHZU-1fqkJcMyvpGRkWGStWuj7Uf3NYv8Wi0KSCVwn435kijCN2fM1", 
        `__ASYNCPOST` = "true", 
        `ctl00$Content_FormContent$uiPretraziButton` = "Pretraži"), 
    encode = "form") -> res

您可以通过以下方式查看结果:

content(res, as="text") # returns raw HTML 

content(res, as="parsed") # returns something you can use with `rvest` / `xml2`

不幸的是,这是另一个无用的SharePoint网站,&#34; eGov&#34;世界各地的网站已经成为一件好事。这意味着您必须进行反复试验以确定哪些参数是必要的,因为它几乎在每个站点上都有所不同。我尝试了一个最小的设置无济于事。

您甚至可能必须首先向主站点发出GET请求以建立会话。

但这应该让你朝着正确的方向前进。