Question

我正试图从此页面中删除维基百科表格中的一些数据： https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014我对此表感兴趣： 2014年印度大选摘要

我还想从表中提取派对颜色。这是我到目前为止所尝试的内容：

library("rvest")
url <- 
"https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014"

electionstats <- read_html(url)
results <- html_nodes(electionstats, xpath='//*[@id="mw-content-text"]/div/table[79]') %>% html_table(fill = T)

party_colors <- electionstats %>% 
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% 
html_table(fill = T)

打印出party_colors不会显示有关颜色的任何信息

所以，我试过了：

party_colors <- electionstats %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>%  
html_nodes('tr')

现在，如果我打印出party_colors，我会得到：

[1] <tr style="background-color:#E9E9E9">\n<th style="text-align:left;vertical-align:bottom;" rowspan="2"></th>\n<th style="text-align:left; ...
[2] <tr style="background-color:#E9E9E9">\n<th style="text-align:center;">No.</th>\n<th style="text-align:center;">+/-</th>\n<th style="text ...
[3] <tr>\n<td style="background-color:#FF9933"></td>\n<td style="text-align:left;"><a href="/wiki/Bharatiya_Janata_Party" title="Bharatiya J ...
[4] <tr>\n<td style="background-color:#00BFFF"></td>\n<td style="text-align:left;"><a href="/wiki/Indian_National_Congress" title="Indian Na ...
[5] <tr>\n<td style="background-color:#009900"></td>\n<td style="text-align:left;"><a href="/wiki/All_India_Anna_Dravida_Munnetra_Kazhagam"  ...

依旧......

但是，现在，我不知道如何从这些数据中提取颜色。我无法使用以下内容将上述输出转换为html_table：

html_table(fill = T)

我收到错误：

Error: html_name(x) == "table" is not TRUE

我也尝试过使用html_attrs的各种选项，但我不知道我应该传递的正确属性是什么。

我甚至尝试使用SelectorGadget来尝试找出属性，但是如果我选择相关表格的第一列，SelectorGadget只显示“td”。

Answer 1

我会像你一样得到表格，然后将颜色属性添加为列。 wikitable可排序类在许多页面上都有效，所以得到第一个并删除第1行中的第二个标题。

electionstats <- read_html(url) 
x <- html_nodes(electionstats, xpath='//table[@class="wikitable sortable"]')[[1]] %>% 
      html_table(fill=TRUE)
# paste names from 2nd row header and then remove
names(x)[6:14] <- paste(names(x)[6:14], x[1,6:14])
x <- x[-1,]

颜色位于第一个tr/td标记中，您可以将其添加到空列1或3（请参阅str(x)）

names(x)[3] <- "Color"
x$Color <- html_nodes(electionstats, xpath='//table[@class="wikitable sortable"][1]/tr/td[1]') %>% 
            html_attr("style") %>% gsub("background-color:", "", .)
## drop table footer, extra columns
x <- x[1:83, 2:14]   
head(x)
                                     Party   Color Alliance Abbr. Candidates No. Candidates +/- Candidates %
2                   Bharatiya Janata Party #FF9933      NDA   BJP            428             -5       78.82%
3                 Indian National Congress #00BFFF      UPA   INC            464             24       85.45%
4 All India Anna Dravida Munnetra Kazhagam #009900           ADMK             40             17        7.37%
5             All India Trinamool Congress #00FF00           AITC            131             96       24.13%
6                          Biju Janata Dal #006400            BJD             21              3        3.87%
7                                Shiv Sena #E3882D      NDA   SHS             24             11       10.68%

Answer 2

您的 xml_nodeset 看起来包含 tr 和 td 节点。

处理trs和tds，转换为数据框：

party_colors_tr <- electionstats %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% html_nodes('tr') trs <- bind_rows(lapply(xml_attrs(party_colors_tr), function(x) data.frame(as.list(x), stringsAsFactors=FALSE))) party_colors_td <- electionstats %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% html_nodes('tr') %>% html_nodes('td') tds <- bind_rows(lapply(xml_attrs(party_colors_td), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))

用于从数据框中提取样式的写入功能：

library(stringi) list_styles <- function(nodes_frame) { get_cols <- function(x) { stri_detect_fixed(x, 'background-color') } has_style <- which(lapply(nodes_frame$style, get_cols) == TRUE) res <- strsplit(nodes_frame[has_style,]$style, ':') return(res) }

创建提取样式的数据框：

l_trs <- list_styles(trs) df_trs <- data.frame(do.call('rbind', l_trs)[,1], do.call('rbind', l_trs)[,2]) names(df_trs) <- c('style', 'color') l_tds <- list_styles(tds) df_tds <- data.frame(do.call('rbind', l_tds)[,1], do.call('rbind', l_tds)[,2]) names(df_tds) <- c('style', 'color')

合并trs和tds帧：

final_style_frame <- do.call('rbind', list(df_trs, df_tds))

以下是前20行：

final_style_frame[1:20,]

R scrape html表并提取背景颜色

2 个答案: