我想从以下链接中删除扇区权重表:
http://portfolios.morningstar.com/fund/summary?t=SPY®ion=usa&culture=en-US&ownerCountry=USA
我想要的表格是网站源代码中的表格6。我有以下用R编写的脚本:
library(rvest)
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
turlr = read_html(turl)
df6<-html_table(html_nodes(turlr, 'table')[[6]], fill = TRUE)
然而,当我运行脚本的最后一行时,我收到以下错误消息
out [j + k,]:下标超出范围
时出错
答案 0 :(得分:1)
由于所需的表格设计方式不同rvest
无法将其格式化为正确的表格。但是使用XML
包你可以很容易地做到。
library(XML)
library(dplyr)
#read required table
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
temp_table <- readHTMLTable(turl)[[6]]
#process table to readable format
final_table <- temp_table %>%
select(V2, V3, V4, V5) %>%
na.omit() %>%
`colnames<-` (c("","% Stocks","Benchmark","Category Avg")) %>%
`rownames<-` (seq_len(nrow(.)))
final_table
输出是:
% Stocks Benchmark Category Avg
1 Cyclical
2 Basic Materials 2.79 3.16 3.22
3 Consumer Cyclical 11.06 11.42 11.15
4 Financial Services 16.39 16.50 17.22
5 Real Estate 2.24 3.18 2.00
6 Sensitive
7 Communication Services 3.56 3.37 3.50
8 Energy 5.83 5.79 5.79
9 Industrials 10.37 10.89 11.70
10 Technology 22.16 21.41 19.72
11 Defensive
12 Consumer Defensive 8.20 7.60 8.56
13 Healthcare 14.24 13.57 14.57
14 Utilities 3.15 3.11 2.59
希望它有所帮助!