网页搜索与rvest建议

时间:2017-03-23 19:02:55

标签: r web-scraping rvest

我正在尝试在网站上搜索下面列出的网站。我在下面列出了我的初始代码:

library(rvest)

session = html_session("https://www.umass.edu/peoplefinder/")

session %>%
  html_form %>%
  .[[3]] %>%
  set_values(search_text = "John") %>%
  submit_form(session, .) %>%
  html_node("table") 

它似乎根本不起作用。有没有人有一些建议?

3 个答案:

答案 0 :(得分:1)

library(rvest)
library(jsonlite)
page<-html_session("https://www.umass.edu/peoplefinder")
details<-rvest:::request_POST(page,url="https://www.umass.edu/peoplefinder/engine/",body=list("q"="John"))
s<-jsonlite::fromJSON("ok.json")
df<-as.data.frame(s)

您将获得可用的数据框df以用于进一步处理

答案 1 :(得分:0)

目标网页中没有table个节点,您可以通过尝试在页面中查找其他内容来确定该节点,例如:

> session %>% html_form %>% .[[3]] %>% set_values(search_text = "John") %>% submit_form(session, .) %>% html_node("ul")
Submitting with 'pf_search'
{xml_node}
<ul class="menu">
[1] <li class="first leaf go-umass"><a title="" href="https://go.umass.edu/">Go.UMass</a></li>
[2] <li class="leaf email"><a title="" href="//www.oit.umass.edu/email">Email</a></li>
[3] <li class="leaf spire"><a title="" href="https://www.spire.umass.edu/">SPIRE</a></li>
[4] <li class="leaf moodle"><a title="" href="https://moodle.umass.edu/">Moodle</a></li>
[5] <li class="leaf umassonline"><a title="" href="https://uma.umassonline.net/">Blackboard Learn</a ...
[6] <li class="last leaf udrive"><a title="" href="https://udrive.oit.umass.edu/">UDrive</a></li>

答案 2 :(得分:0)

这样得到答案:

umass_people_find = function(q)
  "https://www.umass.edu/peoplefinder" %>%
    html_session %>%
    rvest:::request_POST(url = "https://www.umass.edu/peoplefinder/engine/",
                        body=list("q"=q) ) %>%
    .$response %>%
    httr::content("text") %>%
    fromJSON %>%
    .$Results