如何清理和拆分R中的HTML标签?

时间:2016-12-04 08:38:56

标签: html r split

我的解析器创建一个数据框,如下所示:

    name          html
 1  John         <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
 2 Steve         <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>

那么我如何从HTML中提取有用的信息呢?例如,我想使用一些HTML属性作为功能:

   name minute second     id
1  John     68     37   8028
2 Steve     69      4 132205

1 个答案:

答案 0 :(得分:1)

使用rvestpurrr的替代dplyr方法:

library(rvest)
library(purrr)
library(dplyr)

df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>')

by_row(df, .collate="cols", 
       ~read_html(.$html) %>% 
         html_nodes("span:first-of-type") %>% 
         html_attrs() %>% 
         flatten_chr() %>% 
         as.list() %>% 
         flatten_df()) %>% 
  select(-html, -class1) %>% 
  setNames(gsub("^data-|1$", "", colnames(.)))
## # A tibble: 2 × 4
##    name minute second     id
##   <chr>  <chr>  <chr>  <chr>
## 1  John     68     37   8028
## 2 Steve     69      4 132205