对于大学研究,我尝试抓取 FDA 表格(robots.txt 允许抓取此内容)
该表包含 19 行和 2 列: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181
我尝试提取的格式是:
col1 col2 url_of_col2
<chr> <chr> <chr>
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~
我取得的成就:
我可以轻松提取第一列的项目:
#library
library(tidyverse)
library(xml2)
library(rvest)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
html %>%
html_nodes("table") -> tables
tables[[9]] -> table
# extract col 1 items
table %>%
html_nodes("th") %>%
html_text() %>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "Device Classification Name" "510(k) Number"
#> [3] "Device Name" "Applicant"
#> [5] "Applicant Contact" "Correspondent"
#> [7] "Correspondent Contact" "Regulation Number"
#> [9] "Classification Product Code" "Date Received"
#> [11] "Decision Date" "Decision"
#> [13] "Regulation Medical Specialty" "510k Review Panel"
#> [15] "summary" "Type"
#> [17] "Clinical Trials" "Reviewed by Third Party"
#> [19] "Combination Product"
由 reprex package (v1.0.0) 于 2021 年 2 月 27 日创建
我被卡住的地方
# extract col 2 items
table %>%
html_nodes("td") %>%
html_text()%>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"
#> [2] "K203181"
#> [3] "Nerivio, FGD000075-4.7"
#> [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#> [5] "Theranica Bioelectronics ltd"
#> [6] "4 Ha-Omanutst. Poleg Industrial Park"
#> [7] "netanya, IL4250574"
#> [8] "alon ironi"
#> [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"
#> [10] "Hogan Lovells US LLP"
#> [11] "1735 Market Street"
#> [12] "Suite 2300"
#> [13] "philadelphia, PA 19103"
#> [14] "janice m. hogan"
#> [15] "882.5899"
#> [16] "QGT "
#> [17] "QGT "
#> [18] "10/26/2020"
#> [19] "01/22/2021"
#> [20] "substantially equivalent (SESE)"
#> [21] "Neurology"
#> [22] "Neurology"
#> [23] "summary"
#> [24] "Traditional"
#> [25] "NCT04089761"
#> [26] "No"
#> [27] "No"
由 reprex package (v1.0.0) 于 2021 年 2 月 27 日创建
我找到了一个很好的 manual 来读取跨多行单元格的 html 表格。但是,我认为这种方法不适用于嵌套数据框。
关于没有链接的嵌套表 (How to scrape older html with nested tables in R?) 的类似问题尚未得到解答。评论 suggested this question,不幸的是我无法将它应用到我的 html 表中。
有一个旨在读取嵌套 html 表的 unpivotr 包,但是,我无法用该包解决我的问题。
答案 0 :(得分:2)
是的,父表的行中的表确实使它变得更加困难。这个关键是找到表的27行,然后逐行解析。
library(rvest)
library(stringr)
library(dplyr)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]
#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")
answer <-data.frame(leftside, rightside, links)
需要在某些链接上使用 paste("https://www.accessdata.fda.gov/", answer$links)
才能获得完整的网址。
最终的数据帧确实有几个包含“NA”的单元格,这些单元格可以删除,并且可以根据最终要求对表格进行更多清理。将 tidyr::fill()
视为一个好的起点。
更新
要将答案减少到所需的 19 行原始行:
library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")
#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))