使用R从网站中提取表格数据

时间:2018-12-09 01:58:41

标签: r

我想使用R从(https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html)获取信息。

数据不是.csv或excel格式。我不确定从哪里开始。我知道非常基础的R,欢迎您提供任何帮助!谢谢!

3 个答案:

答案 0 :(得分:1)

假设它是您要查找的页面中的数据表

library(tidyverse)
library(rvest)


page <- xml2::read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")

tbl <- html_table(page)[[1]]
tbl <- as.tibble(tbl)
tbl

# A tibble: 260 x 9
   `Medicinal\r\n    … `Submission Numb… `Innovative Dru… Manufacturer `Drug(s) Containi… `Notice of Compl… `6 Year\r\n     … `Pediatric Exte… `Data Protectio…
   <chr>                           <int> <chr>            <chr>         <chr>              <chr>             <chr>             <chr>            <chr>           
 1 abiraterone    ace…            138343 Zytiga           Janssen   I… N/A                2011-07-27        2017-07-27        N/A              2019-07-27      
 2 aclidinium bromide             157598 Tudorza    Genu… AstraZeneca … Duaklir    Genuair 2013-07-29        2019-07-29        N/A              2021-07-29      
 3 afatinib dimaleate             158730 Giotrif          Boehringer  … N/A                2013-11-01        2019-11-01        N/A              2021-11-01      
 4 aflibercept                    149321 Eylea            Bayer    Inc. N/A                2013-11-08        2019-11-08        N/A              2021-11-08      
 5 albiglutide                    165145 Eperzan          GlaxoSmithKl… N/A                2015-07-15        2021-07-15        N/A              2023-07-15      
 6 alectinib hydrochl…            189442 Alecensaro       Hoffmann-La … N/A                2016-09-29        2022-09-29        N/A              2024-09-29      
 7 alirocumab                     183116 Praluent         Sanofi-avent… N/A                2016-04-11        2022-04-11        N/A              2024-04-11      
 8 alogliptin benzoate            158335 Nesina           Takeda    Ca… "Kazano\r\n      … 2013-11-27        2019-11-27        N/A              2021-11-27      
 9 anthrax immune glo…            200446 Anthrasil        Emergent    … N/A                2017-11-06        2023-11-06        Yes              2026-05-06      
10 antihemophilic fac…            163447 Eloctate         Bioverativ  … N/A                2014-08-22        2020-08-22        Yes              2023-02-22      
# ... with 250 more rows  

要在页面上的第二/第三/第四表中进行读取,请将tbl <- html_table(page)[[1]]中的数字更改为希望读取的数字表

答案 1 :(得分:0)

您将可以通过网络抓取来提取此数据。

尝试类似

library(rvest)
library(dplyr)

url <- "https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html"
page_html <- read_html(url)
tables <- page_html %>% html_nodes("table")


for (i in 1:length(tables)) {

  table <- tables[i]

  table_header <- table %>% html_nodes("thead th") %>% html_text(.) %>% trimws(.) %>% gsub("\r", "", .) %>% gsub("\n", "", .)
  table_data <- matrix(ncol=length(table_header), nrow=1) %>% as.data.frame(.)
  colnames(table_data) <- table_header

  rows <- table %>% html_nodes("tr")

  for (j in 2:length(rows)) {
    table_data[j-1, ] <- rows[j] %>% html_nodes("td") %>% html_text(.) %>% trimws(.)
  }

  assign(paste0("table_data", i), table_data)

}

答案 2 :(得分:0)

您可以以相同的方式处理它们,而无需进行for循环和使用assign() shudder )。另外,我们可以为每个表分配表标题(每个表上方的<h2>作为参考:

library(rvest)

xdf <- read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")

tbls <- html_table(xdf, trim = TRUE)

我们使用janitor::clean_names()清理列名,然后找到标题,将其清理为合适的变量名,并将其分配给每个表:

setNames(
  lapply(tbls, function(tbl) {
    janitor::clean_names(tbl) %>% # CLEAN UP TABLE COLUMN NAMES
      tibble::as_tibble() # solely for better printing
  }),
  html_nodes(xdf, "table > caption") %>% # ASSIGN THE TABLE HEADER TO THE LIST ELEMENT
    html_text() %>%                      # BUT WE NEED TO CLEAN THEM UP FIRST
    trimws() %>%
    tolower() %>%
    gsub("[[:punct:][:space:]]+", "_", .) %>%
    gsub("_+", "_", .) %>%
    make.unique(sep = "_")
) -> tbls

现在,我们可以按列表中的名称访问它们,而无需使用nigh-never-recommended assign()(再次是 shudder ):

tbls$products_for_human_use_active_data_protection_period
## # A tibble: 260 x 9
##    medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
##    <chr>                       <int> <chr>           <chr>        <chr>            <chr>            <chr>           
##  1 abiraterone    …           138343 Zytiga          Janssen    … N/A              2011-07-27       2017-07-27      
##  2 aclidinium brom…           157598 Tudorza    Gen… AstraZeneca… Duaklir    Genu… 2013-07-29       2019-07-29      
##  3 afatinib dimale…           158730 Giotrif         Boehringer … N/A              2013-11-01       2019-11-01      
##  4 aflibercept                149321 Eylea           Bayer    In… N/A              2013-11-08       2019-11-08      
##  5 albiglutide                165145 Eperzan         GlaxoSmithK… N/A              2015-07-15       2021-07-15      
##  6 alectinib hydro…           189442 Alecensaro      Hoffmann-La… N/A              2016-09-29       2022-09-29      
##  7 alirocumab                 183116 Praluent        Sanofi-aven… N/A              2016-04-11       2022-04-11      
##  8 alogliptin benz…           158335 Nesina          Takeda    C… "Kazano\r\n    … 2013-11-27       2019-11-27      
##  9 anthrax immune …           200446 Anthrasil       Emergent   … N/A              2017-11-06       2023-11-06      
## 10 antihemophilic …           163447 Eloctate        Bioverativ … N/A              2014-08-22       2020-08-22      
## # ... with 250 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>

tbls$products_for_human_use_expired_data_protection_period
## # A tibble: 92 x 9
##    medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
##    <chr>                       <int> <chr>           <chr>        <chr>            <chr>            <chr>           
##  1 abatacept                   98531 Orencia         Bristol-Mye… N/A              2006-06-29       2012-06-29      
##  2 acamprosate cal…           103287 Campral         Mylan Pharm… N/A              2007-03-16       2013-03-16      
##  3 alglucosidase a…           103381 Myozyme         Genzyme Can… N/A              2006-08-14       2012-08-14      
##  4 aliskiren hemif…           105388 Rasilez         Novartis Ph… "Rasilez HCT\r\… 2007-11-14       2013-11-14      
##  5 ambrisentan                113287 Volibris        GlaxoSmithK… N/A              2008-03-20       2014-03-20      
##  6 anidulafungin              110202 Eraxis          Pfizer Cana… N/A              2007-11-14       2013-11-14      
##  7 aprepitant                 108483 Emend           Merck Fross… "Emend Tri-Pack… 2007-08-24       2013-08-24      
##  8 aripiprazole               120192 Abilify         Bristol-Mye… Abilify Maintena 2009-07-09       2015-07-09      
##  9 azacitidine                127108 Vidaza          Celgene      N/A              2009-10-23       2015-10-23      
## 10 besifloxacin               123400 Besivance       Bausch &   … N/A              2009-10-23       2015-10-23      
## # ... with 82 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>

tbls$products_for_veterinary_use_active_data_protection_period
## # A tibble: 26 x 8
##    medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
##    <chr>                       <int> <chr>           <chr>        <chr>            <chr>            <chr>           
##  1 afoxolaner                 163768 Nexgard         Merial Cana… Nexgard Spectra  2014-07-08       2020-07-08      
##  2 avilamycin                 156949 Surmax 100 Pre… Elanco Cana… Surmax 200 Prem… 2014-02-18       2020-02-18      
##  3 cefpodoxime pro…           149164 Simplicef       Zoetis Cana… N/A              2012-12-06       2018-12-06      
##  4 clodronate diso…           172789 Osphos Injecti… Dechra Ltd.  N/A              2015-05-06       2021-05-06      
##  5 closantel sodium           180678 Flukiver        Elanco Divi… N/A              2015-11-24       2021-11-24      
##  6 derquantel                 184844 Startect        Zoetis Cana… N/A              2016-04-27       2022-04-27      
##  7 dibotermin alfa…           148153 Truscient       Zoetis Cana… N/A              2012-11-20       2018-11-20      
##  8 fluralaner                 166320 Bravecto        Intervet Ca… N/A              2014-05-23       2020-05-23      
##  9 gonadotropin re…           140525 Improvest       Zoetis Cana… N/A              2011-06-22       2017-06-22      
## 10 insulin human (…           150211 Prozinc         Boehringer … N/A              2013-04-24       2019-04-24      
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>

tbls$products_for_veterinary_use_expired_data_protection_period
## # A tibble: 26 x 8
##    medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
##    <chr>            <chr>            <chr>           <chr>        <chr>            <chr>            <chr>           
##  1 acetaminophen    110139           Pracetam 20% O… Ceva Animal… N/A              2009-03-05       2015-03-05      
##  2 buprenorphine h… 126077           Vetergesic Mul… Sogeval UK … N/A              2010-02-03       2016-02-03      
##  3 cefovecin sodium 110061           Convenia        Zoetis Cana… N/A              2007-05-30       2013-05-30      
##  4 cephalexin mono… 126970           Vetolexin       Vétoquinol … Cefaseptin       2010-06-24       2016-06-24      
##  5 dirlotapide      110110           Slentrol        Zoetis Cana… N/A              2008-08-14       2014-08-14      
##  6 emamectin benzo… 109976           Slice           Intervet Ca… N/A              2009-06-29       2015-06-29      
##  7 emodepside       112103 / 112106… Profender       Bayer Healt… N/A              2008-08-28       2014-08-28      
##  8 firocoxib        110661 / 110379  Previcox        Merial Cana… N/A              2007-09-28       2013-09-28      
##  9 fluoxetine hydr… 109825 / 109826… Reconcile       Elanco, Div… N/A              2008-03-28       2014-03-28      
## 10 gamithromycin    125823           Zactran         Merial Cana… N/A              2010-03-29       2016-03-29      
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>

每一个中都有N/A个可以转换为NA的列,并且每个列都有一个drug_s_containing_the_medicinal_ingredient_variations共同的列,当观察值不是N/A时,它就是一个或更多用\r\n分隔的药物,因此我们可以将其转换为列表列,然后使用tidyr::unnest()进行后处理:

lapply(tbls, function(x) {

  # Make "N/A" into real NAs
  x[] <- lapply(x, function(.x) ifelse(.x == "N/A", NA_character_, .x))

  # The common `drug_s_containing_the_medicinal_ingredient_variations`
  # column - when not N/A - has one drug per-line so we can use that 
  # fact to turn it into a list column which you can use `tidyr::unnest()` on
  x$drug_s_containing_the_medicinal_ingredient_variations <- 
    lapply(x$drug_s_containing_the_medicinal_ingredient_variations, function(.x) {
      strsplit(trimws(.x), "[\r\n]+")
    })

  x

}) -> tbls