使用R

时间:2017-11-26 23:36:04

标签: r web-scraping html-table html-parsing rvest

我想从以下链接中删除扇区权重表:

http://portfolios.morningstar.com/fund/summary?t=SPY&region=usa&culture=en-US&ownerCountry=USA

我想要的表格是网站源代码中的表格6。我有以下用R编写的脚本:

 library(rvest)
 turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
 turlr = read_html(turl) 
 df6<-html_table(html_nodes(turlr, 'table')[[6]], fill = TRUE) 

然而,当我运行脚本的最后一行时,我收到以下错误消息

  

out [j + k,]:下标超出范围

时出错

1 个答案:

答案 0 :(得分:1)

由于所需的表格设计方式不同rvest无法将其格式化为正确的表格。但是使用XML包你可以很容易地做到。

library(XML)
library(dplyr)

#read required table
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
temp_table <- readHTMLTable(turl)[[6]]

#process table to readable format
final_table <- temp_table %>%
  select(V2, V3, V4, V5) %>%
  na.omit() %>%
  `colnames<-` (c("","% Stocks","Benchmark","Category Avg")) %>%
  `rownames<-` (seq_len(nrow(.)))
final_table

输出是:

                          % Stocks Benchmark Category Avg
1                Cyclical                                
2         Basic Materials     2.79      3.16         3.22
3       Consumer Cyclical    11.06     11.42        11.15
4      Financial Services    16.39     16.50        17.22
5             Real Estate     2.24      3.18         2.00
6               Sensitive                                
7  Communication Services     3.56      3.37         3.50
8                  Energy     5.83      5.79         5.79
9             Industrials    10.37     10.89        11.70
10             Technology    22.16     21.41        19.72
11              Defensive                                
12     Consumer Defensive     8.20      7.60         8.56
13             Healthcare    14.24     13.57        14.57
14              Utilities     3.15      3.11         2.59

希望它有所帮助!