我正在使用rvest
进行网络抓取,并且正在接受Tripadvisor的培训。我无法将单选按钮设置为适当的值,以便获得所有评论:
library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187438-d12699400-Reviews-Trattoria_Mamma_Franca-Malaga_Costa_del_Sol_Province_of_Malaga_Andalucia.html"
session <- html_session(url)
pgform <- html_form(session)[[3]]
给出表格
<form> 'taplc_location_review_filter_controls_0_form' (POST /SetReviewFilter#REVIEWS)
<input checkbox> 'filterRating': 5
<input checkbox> 'filterRating': 4
<input checkbox> 'filterRating': 3
<input checkbox> 'filterRating': 2
<input checkbox> 'filterRating': 1
<input hidden> 'filterRating':
<input checkbox> 'filterSegment': 3
<input checkbox> 'filterSegment': 2
<input checkbox> 'filterSegment': 5
<input checkbox> 'filterSegment': 1
<input checkbox> 'filterSegment': 4
<input hidden> 'filterSegment':
<input checkbox> 'filterSeasons': 1
<input checkbox> 'filterSeasons': 2
<input checkbox> 'filterSeasons': 3
<input checkbox> 'filterSeasons': 4
<input hidden> 'filterSeasons':
<input radio> 'filterLang': ALL
<input radio> 'filterLang': en
<input radio> 'filterLang': es
<input radio> 'filterLang': it
<input radio> 'filterLang': fr
<input radio> 'filterLang': nl
<input radio> 'filterLang': ru
<input radio> 'filterLang': sv
<input radio> 'filterLang': da
<input radio> 'filterLang': de
<input radio> 'filterLang': no
<input radio> 'filterLang': pl
<input radio> 'filterLang': pt
<input hidden> 'returnTo': #REVIEWS
我想将filterLang
设置为ALL
filledform <- set_values(pgform,
filterLang = "ALL")
submit_form(session,filledform)
给我错误:
Error: Could not find possible submission target.
我应该使用哪种提交方式?我可以使用rvest,还是应该尝试类似this的东西?
答案 0 :(得分:1)
您收到的错误消息与单选按钮无关,而是与您要提交的表单缺少submit
按钮有关,rvest
按钮在尝试提交形成。
作为示例的解决方法,您可以将字段returnTo
的字段类型更改为submit
,并将其值设置为页面本身的URL,如下所示:
pgform$fields[['returnTo']]$type = 'submit'
pgform$fields[['returnTo']]$value = url
然后您可以按预期设置语言选项,例如
filledform <- set_values(pgform, filterLang = 'it')
或
filledform <- set_values(pgform, filterLang = 'ALL')
应该将语言过滤器分别设置为意大利语或所有语言。
类似here,当您执行这样的操作
url <- 'https://www.tripadvisor.com/Restaurant_Review-g187438-d12699400-Reviews-Trattoria_Mamma_Franca-Malaga_Costa_del_Sol_Province_of_Malaga_Andalucia.html'
session <- html_session(url)
pgform <- html_form(session)[[3]]
pgform$fields[['returnTo']]$type = 'submit'
pgform$fields[['returnTo']]$value = url
filledform <- set_values(pgform, filterLang = 'ALL')
result <- submit_form(session, filledform)
您宁愿获得整个页面,而仅使用以下代码获得内容
url <- 'https://www.tripadvisor.com/Restaurant_Review-g187438-d12699400-Reviews-Trattoria_Mamma_Franca-Malaga_Costa_del_Sol_Province_of_Malaga_Andalucia.html'
session <- html_session(url)
pgform <- html_form(session)[[3]]
pgform$fields[['returnTo']]$type = 'submit'
pgform$fields[['returnTo']]$value = url
filledform <- set_values(pgform, filterLang = 'ALL')
result <- submit_form(session, filledform, submit = NULL, httr::add_headers('x-requested-with' = 'XMLHttpRequest'))
由于您正在尝试与大量使用JavaScript和XMLHttpRequest
的相当复杂的网站进行交互,因此最好从rvest
切换为对此类技术提供更好支持的方法,例如RSelenium
。