Question

有人可以解释下面的R网站抓取代码吗？

我在Stack Overflow上找到以下代码，以从Yahoo Finance抓取Apple的财务信息。

具体地说：

步骤2，如何找到“ .fi-row”节点？在Google Chrome浏览器中使用检查功能，我找不到它。在实践中如何找到该节点？
步骤4，此循环中的代码实际上如何工作？它似乎正在做所有的刮scrap。谁能解释这段代码中发生了什么？
步骤5，如何删除标题？代码似乎超级复杂。

请注意，代码上方的注释是我写的，以帮助我理解代码，可能不正确。

library(rvest)
library(stringr)
library(magrittr)

## 1 ## read URL into HTML
url <- read_html('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')

## 2 ## set specific nodes
nodes <- url %>%html_nodes(".fi-row")

## 3 ## create blank dataframe
df = NULL

## 4 ## loop within nodes to extract tabular financial data
for(i in nodes){
  r <- list(i %>%html_nodes("[title],[data-test='fin-col']")%>%html_text())
  df <- rbind(df,as.data.frame(matrix(r[[1]], ncol = length(r[[1]]), byrow = TRUE), stringsAsFactors = FALSE))
  }

## 5 ## extract column heading names 
matches <- str_match_all(url%>%html_node('#Col1-3-Financials-Proxy')%>%html_text(),'\\d{1,2}/\\d{1,2}/\\d{4}')  

## 6 ## combine custom column names with column names from step 5  
headers <- c('Breakdown','TTM', matches[[1]][,1]) 

## 7 ## set dataframe column names
names(df) <- headers

View(df)

非常感谢任何澄清。

欢呼

了解此R网站抓取代码中发生的情况

0 个答案: