Question

我正在研究关于发展中国家的世界银行（WB）项目。为此，我正在抓取他们的网站以收集我感兴趣的数据。

我想要抓取的网页结构如下：

国家/地区列表the list of all countries in which WB has developed projects

1.1。通过单击1上的单个国家/地区，可以获得单个国家/地区项目列表（包括许多网页）it includes all the projects in a single countries 。当然，这里我只包含了一个国家/地区的一页，但每个国家/地区都有一些专门针对此主题的页面

1.1.1。通过单击1.1上的单个项目。，one gets - among the others - the project's overview option我感兴趣。

换句话说，我的问题是找到一种方法来创建包含所有国家/地区的数据框，每个国家/地区的所有项目的完整列表以及任何单个项目的概述。

然而，这是我编写的代码（未成功）：

WB_links <- "http://projects.worldbank.org/country?lang=en&page=projects"

 WB_proj <- function(x) {

  Sys.sleep(5)
 url <- sprintf("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", x)

 html <- read_html(url)

 tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
     project_url = html_nodes(html, ".grid_20") %>% html_attr("href")) 
    }

 WB_scrape <- map_df(1:5, WB_proj) %>% 
 mutate(study_description = 
       map(project_url, 
           ~read_html(sprintf
     ("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", .x)) %>% 
            html_node() %>% 
            html_text()))

有什么建议吗？

注意：如果这个问题看似微不足道，我很抱歉，但我是R的新手，我没有找到帮助，但环顾四周（虽然我可能错过了什么，当然）。

刮取不同级别的网页

0 个答案: