使用R从Familysearch.org Web抓取数据

时间:2015-11-17 13:45:37

标签: r web-scraping rvest

我试图使用R,更具体地说是rvest包,在familySearch.org(请参阅URL bellow)中删除巴西记录表。

Fist在网站上选择了" selctor小工具"。根据我点击selctor的方式返回:" #hr-data-table"或" td"。它们似乎都不起作用:

library(rvest)
url <- 'https://familysearch.org/search/record/results?count=75&englishSubcountryName=Brasil&query=%2Brecord_country%3ABrazil%20%2Brecord_subcountry%3A'
url %>% html() %>%  html_node("#hr-data-table") %>% html_text()

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: NULL

url %>% html() %>%  html_node("td") %>% html_text()

[1] ""

#replacing html_text() with html_table() also does not work. 

关于如何使这项工作的任何想法,最好是在R?

1 个答案:

答案 0 :(得分:2)

将此作为一个非答案的答案来说明它不仅仅是一个评论所允许的。

该网站使用XHR请求动态填充表格。您需要使用selenium(RSelenium)或启动开发人员工具并查看正在进行的请求(在启动开发工具或firebug或浏览器之后重新加载网站)。

在这里&#34;复制为cURL&#34;表格数据的XHR请求的版本:

curl 'https://familysearch.org/search/records?count=75&query=%2Brecord_country%3ABrazil' 
  -H 'Cookie: fs_experiments=u%3D-anon-%2Ca%3Dshared-ui%2Cs%3D4ad72a2f25d45cd9ae7c92e412c176e5%2Cv%3D010011001101000000011111111100111010110100001010000110011000110001011000000010011000000%2Cb%3D82%26a%3Dhome%2Cs%3D32f8f352ce4eaac984ab4a66aca8f354%2Cv%3D1101100110000110000000000100110011101%2Cb%3D19%26a%3Dcampaign%2Cs%3D296f4e3066991d8e9584fd6eb21e8c7c%2Cv%3D0101011111110010110001111%2Cb%3D34%26a%3Dsearch%2Cs%3D7acb79194da98cdfc4a29ecf17854668%2Cv%3D1111111110011110111111111111111101000000000000000000%2Cb%3D83; fs_search_history=https%3A//familysearch.org/search/record/results%3Fcount%3D75%26englishSubcountryName%3DBrasil%26query%3D%252Brecord_country%253ABrazil%2520%252Brecord_subcountry%253A' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36' 
  -H 'Content-Type: application/json' 
  -H 'Accept: */*' 
  -H 'Referer: https://familysearch.org/search/record/results?count=75&englishSubcountryName=Brasil&query=%2Brecord_country%3ABrazil%20%2Brecord_subcountry%3A' 
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Connection: keep-alive' 
  --compressed

可能需要部分或全部。你必须测试它。

很明显,他们有一个API为网站提供支持,所以你可能要考虑编写它们,看看你是否可以获得非官方/私有API的副本,并且只是采用与scrape相同的方式(抓取)通常会在服务器上放置比API调用更多的负载。

您可以针对私有API调用要求进行分类,也可以使用RSelenium来点击并抓取&#34;渲染的DOM元素。