从R中的网页刮出多个表格

时间:2015-04-29 06:00:22

标签: r data.table screen-scraping

我正在尝试将共同基金数据拉入R,我的代码方式适用于单个表,但是当网页中有多个表时,它不起作用。

链接 - https://in.finance.yahoo.com/q/pm?s=115748.BO

我的代码

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)

但是我收到了一条错误消息。

  

(函数(classes,fdef,mtable)中的错误:     无法为签名'&#34; NULL&#34;'找到函数'readHTMLTable'的继承方法   另外:警告信息:   XML内容似乎不是XML:&#39; https://in.finance.yahoo.com/q/pm?s=115748.BO&#39;

我的问题是

  1. 如何从此网页中提取特定的表格?
  2. 如何从此网页中提取所有表格?
  3. 当有多个链接时,从每个网页中提取特定表格的简便方法是什么
  4.   

    Ahttps://in.finance.yahoo.com/q/pm S = 115748.BO

         

    Ahttps://in.finance.yahoo.com/q/pm S = 115749.BO

         

    Ahttps://in.finance.yahoo.com/q/pm S = 115750.BO

    删除&#34; A&#34;从链接中,使用链接。

2 个答案:

答案 0 :(得分:4)

Base R无法访问$k。您可以使用https之类的包。表上的标题实际上是单独的表。该页面实际上由30多个表组成。您想要的数据最类似于带有RCurl的表格:

class = yfnc_datamodoutline1

答案 1 :(得分:2)

这是一个rvest版本,其附加功能是从每个基金页面中提取特定的表格:

library(rvest)
library(dplyr)

pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO", 
           "https://in.finance.yahoo.com/q/pm?s=115749.BO",
           "https://in.finance.yahoo.com/q/pm?s=115750.BO")


extract_tab <- function(sources, tab_idx) {

  data <- lapply(sources, function(x) {

    pg <- html(x)
    pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
    html_table(tabs[[tab_idx]])

  })

  names(data) <- gsub("pm\\?s=", "", basename(sources))

  data

}

extract_tab(pages, 1)

## $`115748.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.76%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.33%
## 
## $`115749.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.77%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.30%
## 
## $`115750.BO`
##                               X1    X2
## 1     Morningstar Return Rating:      
## 2           Year-to-Date Return: 1.95%
## 3         5-Year Average Return: 8.92%
## 4            Number of Years Up:      
## 5          Number of Years Down:      
## 6     Best 1 Yr Total Return ():   N/A
## 7    Worst 1 Yr Total Return ():   N/A
## 8  Best 3-Yr Total Return (N/A):   N/A
## 9 Worst 3-Yr Total Return (N/A):   N/A