我正在尝试使用R从this page中提取所有表,因为html_node我已经传递了“表”。在控制台中,输出很奇怪。数据在网页中可用,但在R控制台中显示为NA。请向我建议我犯错的地方。
library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table
部分输出: X4 X5 X6
1 Direct premiums written (1) Market share (2) 1
2 Market share (2) <NA> NA
3 10.6% <NA> NA
4 6.0 <NA> NA
5 5.4 <NA> NA
6 5.4 <NA> NA
7 5.2 <NA> NA
8 4.5 <NA> NA
9 3.3 <NA> NA
10 3.2 <NA> NA
11 3.0 <NA> NA
12 2.2 <NA> NA
X7 X8 X9 X10
1 State Farm Mutual Automobile Insurance $51,063,111 10.6% 2
2 <NA> <NA> <NA> NA
3 <NA> <NA> <NA> NA
4 <NA> <NA> <NA> NA
5 <NA> <NA> <NA> NA
6 <NA> <NA> <NA> NA
7 <NA> <NA> <NA> NA
8 <NA> <NA> <NA> NA
9 <NA> <NA> <NA> NA
10 <NA> <NA> <NA> NA
11 <NA> <NA> <NA> NA
12 <NA> <NA> <NA> NA
答案 0 :(得分:1)
这些表有两个问题。
首先,我认为如果指定表的类别,将会得到更好的结果。在这种情况下,.tablesorter
。
第二,您会注意到在某些表中第二列标题为Group
,在其他情况下为Group/company
。这就是导致NA
的原因。因此,您需要重命名所有表的列以使其一致。
您可以获得具有重命名的列标题的表的列表,如下所示:
tables <- page %>%
html_nodes("table.tablesorter") %>%
html_table() %>%
lapply(., function(x) setNames(x, c("rank", "group_company",
"direct_premiums_written", "market_share")))
通过查看网页,我们可以看到这些表分别用于2017年,2008年至2011年以及2013年至2016年。因此我们可以将这些年作为名称添加到列表中,然后将表与年份列绑定在一起:>
library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>%
bind_rows(.id = "Year")
答案 1 :(得分:1)
这会将所有表放入一个数据框:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
然后可以根据需要子集此子集。例如,让我们从代表2009的第三张表中提取信息:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
要一次性添加所有年份:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
希望这会有所帮助。
答案 2 :(得分:0)
列表中有多个项目已命名为table
。 (不是一个好习惯:这个名字有一个函数。)
str(tbl)
List of 18
$ :'data.frame': 12 obs. of 45 variables:
..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output
也许您只想要最后一个?
tbl[[18]]
Rank Group/company
1 1 State Farm Mutual Automobile Insurance
2 2 Berkshire Hathaway Inc.
3 3 Liberty Mutual
4 4 Allstate Corp.
5 5 Progressive Corp.
6 6 Travelers Companies Inc.
7 7 Chubb Ltd.
8 8 Nationwide Mutual Group
9 9 Farmers Insurance Group of Companies (3)
10 10 USAA Insurance Group
Direct premiums written (1) Market share (2)
1 $62,189,311 10.2%
2 33,300,439 5.4
3 32,217,215 5.3
4 30,875,771 5.0
5 23,951,690 3.9
6 23,918,048 3.9
7 20,786,847 3.4
8 19,756,093 3.2
9 19,677,601 3.2
10 18,273,675 3.0
不,返回页面,很明显您想要第一个,但是它的结构似乎被误解了,并且数据被排列为“宽”,所有数据都位于第一行。因此,一些列正在显示,其余数据似乎被弄乱了;只需取2:4:
tbl[[1]][ ,c('X2','X3','X4')]
X2 X3
1 Rank Group/company
2 Group/company Direct premiums written (1)
3 State Farm Mutual Automobile Insurance $64,892,583
4 Berkshire Hathaway Inc. 38,408,251
5 Liberty Mutual 33,831,726
6 Allstate Corp. 31,501,664
7 Progressive Corp. 27,862,882
8 Travelers Companies Inc. 24,875,076
9 Chubb Ltd. 21,266,737
10 USAA Insurance Group 20,151,368
11 Farmers Insurance Group of Companies (3) 19,855,517
12 Nationwide Mutual Group 19,218,907
X4
1 Direct premiums written (1)
2 Market share (2)
3 10.1%
4 6.0
5 5.3
6 4.9
7 4.3
8 3.9
9 3.3
10 3.1
11 3.1
12 3.0