R:使用链接(单元格内的表格)抓取嵌套的 html 表格

时间:2021-02-26 23:55:10

标签: r web-scraping rvest

对于大学研究,我尝试抓取 FDA 表格(robots.txt 允许抓取此内容)

该表包含 19 行和 2 列: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181

我尝试提取的格式是:

col1                 col2                                                    url_of_col2                                                                   
  <chr>                <chr>                                                   <chr>                                                                         
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~

我取得的成就:

我可以轻松提取第一列的项目:

#library
library(tidyverse)
library(xml2)
library(rvest)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
html %>% 
  html_nodes("table") -> tables
tables[[9]] -> table

# extract col 1 items
table %>%
  html_nodes("th") %>% 
  html_text() %>%
  gsub("\n|\t|\r","",.) %>% 
  trimws()
#>  [1] "Device Classification Name"   "510(k) Number"               
#>  [3] "Device Name"                  "Applicant"                   
#>  [5] "Applicant Contact"            "Correspondent"               
#>  [7] "Correspondent Contact"        "Regulation Number"           
#>  [9] "Classification Product Code"  "Date Received"               
#> [11] "Decision Date"                "Decision"                    
#> [13] "Regulation Medical Specialty" "510k Review Panel"           
#> [15] "summary"                      "Type"                        
#> [17] "Clinical Trials"              "Reviewed by Third Party"     
#> [19] "Combination Product"

reprex package (v1.0.0) 于 2021 年 2 月 27 日创建

我被卡住的地方

  1. 由于第 2 列的某些单元格包含表格,因此此方法不会提供相同数量的项目:
# extract col 2 items
table %>%
  html_nodes("td") %>% 
  html_text()%>%
  gsub("\n|\t|\r","",.) %>% 
  trimws()
#>  [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"       
#>  [2] "K203181"                                                                           
#>  [3] "Nerivio, FGD000075-4.7"                                                            
#>  [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#>  [5] "Theranica Bioelectronics ltd"                                                      
#>  [6] "4 Ha-Omanutst. Poleg Industrial Park"                                              
#>  [7] "netanya, IL4250574"                                                                
#>  [8] "alon  ironi"                                                                       
#>  [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"            
#> [10] "Hogan Lovells US LLP"                                                              
#> [11] "1735 Market Street"                                                                
#> [12] "Suite 2300"                                                                        
#> [13] "philadelphia, PA 19103"                                                            
#> [14] "janice m. hogan"                                                                   
#> [15] "882.5899"                                                                          
#> [16] "QGT  "                                                                             
#> [17] "QGT  "                                                                             
#> [18] "10/26/2020"                                                                        
#> [19] "01/22/2021"                                                                        
#> [20] "substantially equivalent (SESE)"                                                   
#> [21] "Neurology"                                                                         
#> [22] "Neurology"                                                                         
#> [23] "summary"                                                                           
#> [24] "Traditional"                                                                       
#> [25] "NCT04089761"                                                                       
#> [26] "No"                                                                                
#> [27] "No"

reprex package (v1.0.0) 于 2021 年 2 月 27 日创建

  1. 此外,我找不到提取 col2 网址的方法

我找到了一个很好的 manual 来读取跨多行单元格的 html 表格。但是,我认为这种方法不适用于嵌套数据框。

关于没有链接的嵌套表 (How to scrape older html with nested tables in R?) 的类似问题尚未得到解答。评论 suggested this question,不幸的是我无法将它应用到我的 html 表中。

有一个旨在读取嵌套 html 表的 unpivotr 包,但是,我无法用该包解决我的问题。

1 个答案:

答案 0 :(得分:2)

是的,父表的行中的表确实使它变得更加困难。这个关键是找到表的27行,然后逐行解析。

library(rvest)
library(stringr)
library(dplyr)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
tables <- html %>%  html_nodes("table") 
table <- tables[[9]] 


#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>%  html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>%  html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href") 

answer <-data.frame(leftside, rightside, links)

需要在某些链接上使用 paste("https://www.accessdata.fda.gov/", answer$links) 才能获得完整的网址。
最终的数据帧确实有几个包含“NA”的单元格,这些单元格可以删除,并且可以根据最终要求对表格进行更多清理。将 tidyr::fill() 视为一个好的起点。

更新
要将答案减少到所需的 19 行原始行:

library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")

#Create the final results
finalanswer <- answer %>% group_by(leftside) %>% 
                summarize(info=paste(rightside, collapse = " "), link=first(links))