Question

我在R中使用rvest软件包从网页获取表。但是我得到的细节不是格式，我也想将它们保存在csv文件中。下面是我的代码块。如何以excel或csv格式查看和保存结果

url <- "https://www.moneycontrol.com/india/stockpricequote/metals-non-ferrous/hindustancopper/HC07"
url %>%
  read_html() %>%
  html_nodes('#mktdet_1') %>%
  html_text()

Answer 1

这是供您使用的通用解决方案。您可以采用多种不同的方式来解析此信息并将其存储到数据框中或将其写入文本文件。这实际上取决于您的用例。但是，第一个目标是将每个元素提取到向量中自己的元素中。您的代码是一个好的开始。我们可以在此基础上，但添加一个额外的css选择器，这使事情变得容易得多。

library(rvest)
library(dplyr)
library(xml2)
library(stringr)

#Define list of URL's to scrape
url_vec <- list(hindustal_copper = "https://www.moneycontrol.com/india/stockpricequote/metals-non-ferrous/hindustancopper/HC07",
                reliance = "https://www.moneycontrol.com/india/stockpricequote/refineries/relianceindustries/RI",
                dhcf = "https://www.moneycontrol.com/india/stockpricequote/finance-housing/dewanhousingfinancecorporation/DHF")

#Define empty dataframe
result_df = data.frame(name = character(),property = character(),value = numeric())

#For each url
for(name in names(url_vec)){
  table = url_vec %>%
    .[[name]] %>%               #Extract the URL
    read_html() %>%             # Read the HTML
    html_nodes('#mktdet_1')%>%  # Extract the table ID
    html_nodes(".PA7.brdb")%>%  # Extract each of the elements in the tables
    html_text() %>%             # Convert to text
    str_replace_all("[\\\t|\\\r|\\\n]"," ") %>%   #Remove tab, return carrage and new line 
    str_squish()  # Remove White space


  text = gsub("^([a-zA-z\\(\\)%/. ]+)[0-9,\\.%]+$","\\1",table) #Extract the property elements

  value = gsub("^[a-zA-z\\(\\)%/. ]+([0-9,\\.%]+)$","\\1",table)  #Extract the numbers
  value_num = as.numeric(gsub("[%, ]","",value)) # Convert numbers in character format to numeric

  tbl = data.frame(name = rep(name,length(text)),property = text,value = value_num) #Create a temp dataframe
  result_df = rbind(result_df,tbl) #Row bind with the original dataframe

  #Deliverables are NA because they need to be extracted from the name. Use the appropriate regex to do this
}

write.csv(result_df,file = "stock_stats.csv",row.names = F)

表的结果只是一个向量，每个元素都有自己的索引。 text和value仅将列标签和值分开。然后，您可以根据用途存储它。

在R中使用rvest将网页抓取的表格保存在csv中

1 个答案: