我试图使用R,更具体地说是rvest包,在familySearch.org(请参阅URL bellow)中删除巴西记录表。
Fist在网站上选择了" selctor小工具"。根据我点击selctor的方式返回:" #hr-data-table"或" td"。它们似乎都不起作用:
library(rvest)
url <- 'https://familysearch.org/search/record/results?count=75&englishSubcountryName=Brasil&query=%2Brecord_country%3ABrazil%20%2Brecord_subcountry%3A'
url %>% html() %>% html_node("#hr-data-table") %>% html_text()
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: NULL
url %>% html() %>% html_node("td") %>% html_text()
[1] ""
#replacing html_text() with html_table() also does not work.
关于如何使这项工作的任何想法,最好是在R?
答案 0 :(得分:2)
将此作为一个非答案的答案来说明它不仅仅是一个评论所允许的。
该网站使用XHR请求动态填充表格。您需要使用selenium(RSelenium)或启动开发人员工具并查看正在进行的请求(在启动开发工具或firebug或浏览器之后重新加载网站)。
在这里&#34;复制为cURL&#34;表格数据的XHR请求的版本:
curl 'https://familysearch.org/search/records?count=75&query=%2Brecord_country%3ABrazil'
-H 'Cookie: fs_experiments=u%3D-anon-%2Ca%3Dshared-ui%2Cs%3D4ad72a2f25d45cd9ae7c92e412c176e5%2Cv%3D010011001101000000011111111100111010110100001010000110011000110001011000000010011000000%2Cb%3D82%26a%3Dhome%2Cs%3D32f8f352ce4eaac984ab4a66aca8f354%2Cv%3D1101100110000110000000000100110011101%2Cb%3D19%26a%3Dcampaign%2Cs%3D296f4e3066991d8e9584fd6eb21e8c7c%2Cv%3D0101011111110010110001111%2Cb%3D34%26a%3Dsearch%2Cs%3D7acb79194da98cdfc4a29ecf17854668%2Cv%3D1111111110011110111111111111111101000000000000000000%2Cb%3D83; fs_search_history=https%3A//familysearch.org/search/record/results%3Fcount%3D75%26englishSubcountryName%3DBrasil%26query%3D%252Brecord_country%253ABrazil%2520%252Brecord_subcountry%253A'
-H 'Accept-Encoding: gzip, deflate, sdch'
-H 'Accept-Language: en-US,en;q=0.8'
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
-H 'Content-Type: application/json'
-H 'Accept: */*'
-H 'Referer: https://familysearch.org/search/record/results?count=75&englishSubcountryName=Brasil&query=%2Brecord_country%3ABrazil%20%2Brecord_subcountry%3A'
-H 'X-Requested-With: XMLHttpRequest'
-H 'Connection: keep-alive'
--compressed
可能需要部分或全部。你必须测试它。
很明显,他们有一个API为网站提供支持,所以你可能要考虑编写它们,看看你是否可以获得非官方/私有API的副本,并且只是采用与scrape相同的方式(抓取)通常会在服务器上放置比API调用更多的负载。
您可以针对私有API调用要求进行分类,也可以使用RSelenium来点击并抓取&#34;渲染的DOM元素。