如何在web-scraped html表中包含属性

时间:2017-08-26 18:33:38

标签: r html-parsing rvest

我正在使用rvest从内部网站的HTML表中删除数据。行的颜色是有意义的,因此我想将BGCOLOR属性作为最终表中的列提取,但当然html_table()仅提取内容。

这是我到目前为止所拥有的。下面是html表的一个片段。如何添加颜色列?

html_nodes(samplepage,"table")
tbl_content <- samplepage %>%
     html_nodes("table") %>%
     html_table(fill = TRUE, trim = TRUE)
tbl_content
<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl     <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center">    51.5 <td align="center">    32
</tr>
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl        <td>  12.2 <td>   1.7 <td align="center">   7.6 <td align="center">  14.9 <td align="center">    71 <td align="center">    33
</tr>

1 个答案:

答案 0 :(得分:2)

您可以构建自己的解析器来替换html_tablepurrr::map_df可以方便地迭代节点(在这种情况下为tr)并将结果组合到data.frame中:

library(rvest)
library(tidyverse)

html <- '<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl     <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center">    51.5 <td align="center">    32
</tr>
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl        <td>  12.2 <td>   1.7 <td align="center">   7.6 <td align="center">  14.9 <td align="center">    71 <td align="center">    33
</tr>'

parsed_df <- html %>% 
    read_html() %>% 
    html_nodes('tr') %>% 
    map_df(~bind_cols(data_frame(bgcolor = html_attr(.x, 'bgcolor')),    # grab attribute
                      # extract each row's values to 1-row data.frame
                      html_nodes(.x, 'td') %>% 
                          html_text(trim = TRUE) %>% 
                          set_names(paste0('x', seq_along(.))) %>%    # or `%>% t() %>% as_data_frame()`
                          invoke(data_frame, .))) %>% 
    type_convert()    # clean up types

parsed_df
#> # A tibble: 2 x 9
#>   bgcolor        x1     x2     x3    x4    x5    x6    x7    x8
#>     <chr>     <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 #F8C0E0 BASOPHILS microl  0.477 0.425 0.052  1.92  51.5    32
#> 2 #F8F0B0   CALCIUM  mg/dl 12.200 1.700 7.600 14.90  71.0    33

更简单但不太灵活,你可以拉出属性,然后将其合并到html_table的结果中:

paste('<table>', html, '</table>') %>%    # `html_table` needs a <table> tag
    read_html() %>% 
    {
        data.frame(bgcolor = html_nodes(., 'tr') %>% html_attr('bgcolor'), 
                   html_table(.))
    }
#>   bgcolor        X1     X2     X3    X4    X5    X6   X7 X8
#> 1 #F8C0E0 BASOPHILS microl  0.477 0.425 0.052  1.92 51.5 32
#> 2 #F8F0B0   CALCIUM  mg/dl 12.200 1.700 7.600 14.90 71.0 33