rvest:如何更改表单中的单选按钮值

时间:2019-01-10 17:28:30

标签: r web-scraping rvest

我正在使用rvest进行网络抓取,并且正在接受Tripadvisor的培训。我无法将单选按钮设置为适当的值,以便获得所有评论:

library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187438-d12699400-Reviews-Trattoria_Mamma_Franca-Malaga_Costa_del_Sol_Province_of_Malaga_Andalucia.html"
session <- html_session(url)
pgform <- html_form(session)[[3]]

给出表格

<form> 'taplc_location_review_filter_controls_0_form' (POST /SetReviewFilter#REVIEWS)
  <input checkbox> 'filterRating': 5
  <input checkbox> 'filterRating': 4
  <input checkbox> 'filterRating': 3
  <input checkbox> 'filterRating': 2
  <input checkbox> 'filterRating': 1
  <input hidden> 'filterRating': 
  <input checkbox> 'filterSegment': 3
  <input checkbox> 'filterSegment': 2
  <input checkbox> 'filterSegment': 5
  <input checkbox> 'filterSegment': 1
  <input checkbox> 'filterSegment': 4
  <input hidden> 'filterSegment': 
  <input checkbox> 'filterSeasons': 1
  <input checkbox> 'filterSeasons': 2
  <input checkbox> 'filterSeasons': 3
  <input checkbox> 'filterSeasons': 4
  <input hidden> 'filterSeasons': 
  <input radio> 'filterLang': ALL
  <input radio> 'filterLang': en
  <input radio> 'filterLang': es
  <input radio> 'filterLang': it
  <input radio> 'filterLang': fr
  <input radio> 'filterLang': nl
  <input radio> 'filterLang': ru
  <input radio> 'filterLang': sv
  <input radio> 'filterLang': da
  <input radio> 'filterLang': de
  <input radio> 'filterLang': no
  <input radio> 'filterLang': pl
  <input radio> 'filterLang': pt
  <input hidden> 'returnTo': #REVIEWS

我想将filterLang设置为ALL

filledform <- set_values(pgform,
                         filterLang = "ALL")
submit_form(session,filledform)

给我错误:

Error: Could not find possible submission target.

我应该使用哪种提交方式?我可以使用rvest,还是应该尝试类似this的东西?

1 个答案:

答案 0 :(得分:1)

您收到的错误消息与单选按钮无关,而是与您要提交的表单缺少submit按钮有关,rvest按钮在尝试提交形成。

作为示例的解决方法,您可以将字段returnTo的字段类型更改为submit,并将其值设置为页面本身的URL,如下所示:

pgform$fields[['returnTo']]$type = 'submit'
pgform$fields[['returnTo']]$value = url

然后您可以按预期设置语言选项,例如

filledform <- set_values(pgform, filterLang = 'it')

filledform <- set_values(pgform, filterLang = 'ALL')

应该将语言过滤器分别设置为意大利语或所有语言。

类似here,当您执行这样的操作

url <- 'https://www.tripadvisor.com/Restaurant_Review-g187438-d12699400-Reviews-Trattoria_Mamma_Franca-Malaga_Costa_del_Sol_Province_of_Malaga_Andalucia.html'
session <- html_session(url)
pgform <- html_form(session)[[3]]
pgform$fields[['returnTo']]$type = 'submit'
pgform$fields[['returnTo']]$value = url
filledform <- set_values(pgform, filterLang = 'ALL')
result <- submit_form(session, filledform)

您宁愿获得整个页面,而仅使用以下代码获得内容

url <- 'https://www.tripadvisor.com/Restaurant_Review-g187438-d12699400-Reviews-Trattoria_Mamma_Franca-Malaga_Costa_del_Sol_Province_of_Malaga_Andalucia.html'
session <- html_session(url)
pgform <- html_form(session)[[3]]
pgform$fields[['returnTo']]$type = 'submit'
pgform$fields[['returnTo']]$value = url
filledform <- set_values(pgform, filterLang = 'ALL')
result <- submit_form(session, filledform, submit = NULL, httr::add_headers('x-requested-with' = 'XMLHttpRequest'))

由于您正在尝试与大量使用JavaScript和XMLHttpRequest的相当复杂的网站进行交互,因此最好从rvest切换为对此类技术提供更好支持的方法,例如RSelenium