多个页面上的 R 网络抓取图表

时间:2021-02-19 14:24:45

标签: r web-scraping gplots

也许这个主题在其他帖子中得到了处理,但我找不到解决我的问题的方法。 我正在尝试从 https://tradingeconomics.com/indicators 网站抓取数据。我正在尝试抓取有关指标的数据,特别是国家名称和任何国家/地区链接中包含的地块。

tradec = function(tradelink) {
trade_page = read_html(tradelink)
trade_element = trade_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(trade_element)
}

main_page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
country_list <-  main_page %>% 
html_nodes("#ctl00_ContentPlaceHolder1_ctl01_UpdatePanel1 a") %>% 
html_text() %>% 
trimws %>% 
gsub(" ", "-", .)


tradec_df = data.frame()

for (i in country_list) {
link = paste0("https://tradingeconomics.com/", i , "/gdp-growth")
page = read_html(link)

country = page %>% html_nodes("#SelectCountries") %>% html_text()
tradec_charts = page %>% html_nodes("#ImageChart") %>% html_text

tradec_df = rbind(tradec_df, data.frame(country, tradec_charts, stringsAsFactors = FALSE))
print(paste("Page:", country_list)) 

} 

在理想的世界中,我希望为每个国家/地区打印一页,包括国名和情节。我很确定地块可能会以某种方式被刮掉并显示出来,尽管我不知道如何。 有什么建议吗?

1 个答案:

答案 0 :(得分:1)

它不起作用,因为 countries 变量中的每个元素都包含非法字符:

 [1] "\r\n                                        South Africa\r\n                                    "          
 [2] "\r\n                                        Peru\r\n                                    "                  
 [3] "\r\n                                        Botswana\r\n                                    "   

所以你需要做的就是用 trimws() 删除这些字符,让它们看起来像这样:

country_list
 [1] "South Africa"           "Peru"                   "Botswana"               "India"                  "Turkey"                
 [6] "New Zealand"            "Argentina"              "Malta"                  "Slovenia"               "El Salvador"           
[11] "Ireland"                "Rwanda"                 "Albania"                "Luxembourg"             "Nigeria"               
[16] "Canada"                 "Jamaica"                "Uruguay"                "Brazil"                 "Paraguay"  

这有效。我更改的唯一一行是将管道添加到 trimws():

library(tidyverse)
library(rvest)


tradec = function(tradelink) {
trade_page = read_html(tradelink)
trade_element = trade_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(trade_element)
}

main_page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
country_list <-  main_page %>% 
  html_nodes("#ctl00_ContentPlaceHolder1_ctl01_UpdatePanel1 a") %>% 
  html_text() %>% 
  trimws


tradec_df = data.frame()

for (i in country_list) {
  link = paste0("https://tradingeconomics.com/", i , "/gdp-growth")
  page = read_html(link)
  
  country = page %>% html_nodes("#SelectCountries") %>% html_text()
  tradec_links = page %>% html_nodes("#ImageChart") %>% html_text
}