我正在尝试开发R函数,以从我们的在线图书馆中抓取搜索结果。为了测试该功能,我正在带有搜索框的普通网站上测试该功能,Google和Wikipedia。我正在使用r软件包Polite
,rvest
和tidyverse
进行连接和抓取。
我遵循一个简单的算法:
我有以下要求:
该示例假定由于某种原因我需要月球数据。目标是searchFunction("Moon")
形式的r函数。
library(rvest)
library(polite)
library(tidyverse)
#establish the search engine where the form is located
gBow <- bow("https://www.google.com/")
#fill out form
gSearchForm <- scrape(gBow) %>%
html_node("form") %>%
html_form() %>%
set_values(q = "Moon")
#get results of query
results <- submit_form(gBow, gSearchForm, submit = "btnG")
Format and display query results
#display in browser works fine
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()
#scrape results throws an error "No scraping allowed here"
scrape(results)
#Nod does not allow "access" to results page, which was my first thought
gSearchNod <- nod(gBow, resultsPath)
#Resulting session URL is still www.google.com, not the updated URL.
scrape(gSearchNod)
#And yet I can still navigate the results page with the rvest commands just fine
results %>%
follow_link("Moon - Wikipedia") %>%
html_node(".infobox") %>%
html_table(fill=T) %>%
select("Stat"=X1, "Dist"=X2) %>%
filter(Stat %in% c("Perigee", "Apogee"))
因此,我们知道消除尝试,直接查询Wikipedia。请注意,显示网址步骤在性能上存在差异,否则与上述问题在总体上相似。
#establish the search engine where the form is located
wikiBow <- bow("https://en.wikipedia.org/wiki/Main_Page")
#fill out form
wikiSearchForm <- scrape(wikiBow) %>%
html_node("form") %>%
html_form() %>%
set_values(search = "Moon")
#get results of query
results <- submit_form(wikiBow, wikiSearchForm, submit = "fulltext")
在浏览器中显示会将您带到内部搜索页面,而不会将您转发到名为“ Wiki / Moon”的页面。
这实际上不是问题,如果我可以很好地解析此页面
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()
#scrape results throws an error "No scraping allowed here", same error
scrape(results)
#Nod does not allow "access" to results page, which was my first thought
wSearchNod <- nod(wikiBow, resultsPath)
scrape(wSearchNod)
#And yet I can still navigate the results page with the rvest commands just fine
results %>%
follow_link("Moon") %>%
html_node(".infobox") %>%
html_table(fill=T) %>%
select("Stat"=X1, "Dist"=X2) %>%
filter(Stat %in% c("Perigee", "Apogee"))
我正在整个搜索引擎中可靠地重现错误,包括我为此设计的内部错误。我认为该错误与结果页的权限有关,但我似乎无法解决这不是直接对URL进行长时间编码的问题。
#the goal (functionally)
searchFunction <- function("searchTerm"){
s <- bow("www.internalLibrary.com")
scrape(bow) %>%
html_node("form") %>%
html_form() %>%
set_values(q = SearchTerm) %>%
submit_form(s, .) %>%
??????????????????????????????????????
nod() %>%
scrape() %>%
consider_life_choices_that_led_me_here() %>%
??????????????????????????????????????
html_node("#results") %>%
html_table() %>%
select("FY19" = cost, "Date" = approval) %>%
data.frame() %>%
return()
}
An alternate:
... %>%
xml_node("#results") %>%
xml_text() %>%
str_replace_all(...
ending output is fine. Usable data can be coerced as needed.
请谢谢。