使用RVest +礼貌处理html表单时遇到麻烦,特别是解析搜索结果

时间:2019-04-19 13:45:41

标签: r web-scraping rvest

我正在尝试开发R函数,以从我们的在线图书馆中抓取搜索结果。为了测试该功能,我正在带有搜索框的普通网站上测试该功能,Google和Wikipedia。我正在使用r软件包Politervesttidyverse进行连接和抓取。

我遵循一个简单的算法:

  • 构建查询
  • 提交查询并捕获结果
  • 显示查询结果

我有以下要求:

  • 完全在R studio中执行
  • 不需要API /后端数据库访问
  • 最终结果是R中没有可用硬编码值的可用数据表

该示例假定由于某种原因我需要月球数据。目标是searchFunction("Moon")形式的r函数。

library(rvest)
library(polite)  
library(tidyverse)

#establish the search engine where the form is located
gBow <-   bow("https://www.google.com/")

#fill out form
gSearchForm <- scrape(gBow) %>%
  html_node("form") %>%
  html_form() %>%
  set_values(q = "Moon")

#get results of query
results <- submit_form(gBow, gSearchForm, submit = "btnG")
Format and display query results

#display in browser works fine
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()

#scrape results throws an error "No scraping allowed here"
scrape(results)

#Nod does not allow "access" to results page, which was my first thought
gSearchNod <- nod(gBow, resultsPath)
#Resulting session URL is still www.google.com, not the updated URL.
scrape(gSearchNod)

#And yet I can still navigate the results page with the rvest commands just fine
results %>% 
  follow_link("Moon - Wikipedia") %>% 
  html_node(".infobox") %>% 
  html_table(fill=T) %>% 
  select("Stat"=X1, "Dist"=X2) %>% 
  filter(Stat %in% c("Perigee", "Apogee"))

因此,我们知道消除尝试,直接查询Wikipedia。请注意,显示网址步骤在性能上存在差异,否则与上述问题在总体上相似。

#establish the search engine where the form is located
wikiBow <- bow("https://en.wikipedia.org/wiki/Main_Page")

#fill out form
wikiSearchForm <- scrape(wikiBow) %>%
  html_node("form") %>%
  html_form() %>%
  set_values(search = "Moon")

#get results of query
results <- submit_form(wikiBow, wikiSearchForm, submit = "fulltext")

在浏览器中显示会将您带到内部搜索页面,而不会将您转发到名为“ Wiki / Moon”的页面。

这实际上不是问题,如果我可以很好地解析此页面

resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()

#scrape results throws an error "No scraping allowed here", same error
scrape(results)

#Nod does not allow "access" to results page, which was my first thought
wSearchNod <- nod(wikiBow, resultsPath)
scrape(wSearchNod)

#And yet I can still navigate the results page with the rvest commands just fine
results %>% 
  follow_link("Moon") %>% 
  html_node(".infobox") %>% 
  html_table(fill=T) %>% 
  select("Stat"=X1, "Dist"=X2) %>% 
  filter(Stat %in% c("Perigee", "Apogee"))

我正在整个搜索引擎中可靠地重现错误,包括我为此设计的内部错误。我认为该错误与结果页的权限有关,但我似乎无法解决这不是直接对URL进行长时间编码的问题。

#the goal (functionally)

searchFunction <- function("searchTerm"){
s <- bow("www.internalLibrary.com")

scrape(bow) %>% 
html_node("form") %>% 
html_form() %>% 
set_values(q = SearchTerm) %>%
submit_form(s, .) %>%
??????????????????????????????????????
nod() %>%
scrape() %>%
consider_life_choices_that_led_me_here() %>%
??????????????????????????????????????
html_node("#results") %>%
html_table() %>%
select("FY19" = cost, "Date" = approval) %>%
data.frame() %>%
return()
}

An alternate:
... %>%
xml_node("#results") %>%
xml_text() %>%
str_replace_all(...

ending output is fine. Usable data can be coerced as needed. 

请谢谢。

0 个答案:

没有答案