数据不是.csv或excel格式。我不确定从哪里开始。我知道非常基础的R,欢迎您提供任何帮助!谢谢!
答案 0 :(得分:1)
假设它是您要查找的页面中的数据表
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbl <- html_table(page)[[1]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 260 x 9
`Medicinal\r\n … `Submission Numb… `Innovative Dru… Manufacturer `Drug(s) Containi… `Notice of Compl… `6 Year\r\n … `Pediatric Exte… `Data Protectio…
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 abiraterone ace… 138343 Zytiga Janssen I… N/A 2011-07-27 2017-07-27 N/A 2019-07-27
2 aclidinium bromide 157598 Tudorza Genu… AstraZeneca … Duaklir Genuair 2013-07-29 2019-07-29 N/A 2021-07-29
3 afatinib dimaleate 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01 N/A 2021-11-01
4 aflibercept 149321 Eylea Bayer Inc. N/A 2013-11-08 2019-11-08 N/A 2021-11-08
5 albiglutide 165145 Eperzan GlaxoSmithKl… N/A 2015-07-15 2021-07-15 N/A 2023-07-15
6 alectinib hydrochl… 189442 Alecensaro Hoffmann-La … N/A 2016-09-29 2022-09-29 N/A 2024-09-29
7 alirocumab 183116 Praluent Sanofi-avent… N/A 2016-04-11 2022-04-11 N/A 2024-04-11
8 alogliptin benzoate 158335 Nesina Takeda Ca… "Kazano\r\n … 2013-11-27 2019-11-27 N/A 2021-11-27
9 anthrax immune glo… 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06 Yes 2026-05-06
10 antihemophilic fac… 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22 Yes 2023-02-22
# ... with 250 more rows
要在页面上的第二/第三/第四表中进行读取,请将tbl <- html_table(page)[[1]]
中的数字更改为希望读取的数字表
答案 1 :(得分:0)
您将可以通过网络抓取来提取此数据。
尝试类似
library(rvest)
library(dplyr)
url <- "https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html"
page_html <- read_html(url)
tables <- page_html %>% html_nodes("table")
for (i in 1:length(tables)) {
table <- tables[i]
table_header <- table %>% html_nodes("thead th") %>% html_text(.) %>% trimws(.) %>% gsub("\r", "", .) %>% gsub("\n", "", .)
table_data <- matrix(ncol=length(table_header), nrow=1) %>% as.data.frame(.)
colnames(table_data) <- table_header
rows <- table %>% html_nodes("tr")
for (j in 2:length(rows)) {
table_data[j-1, ] <- rows[j] %>% html_nodes("td") %>% html_text(.) %>% trimws(.)
}
assign(paste0("table_data", i), table_data)
}
答案 2 :(得分:0)
您可以以相同的方式处理它们,而无需进行for
循环和使用assign()
( shudder )。另外,我们可以为每个表分配表标题(每个表上方的<h2>
作为参考:
library(rvest)
xdf <- read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbls <- html_table(xdf, trim = TRUE)
我们使用janitor::clean_names()
清理列名,然后找到标题,将其清理为合适的变量名,并将其分配给每个表:
setNames(
lapply(tbls, function(tbl) {
janitor::clean_names(tbl) %>% # CLEAN UP TABLE COLUMN NAMES
tibble::as_tibble() # solely for better printing
}),
html_nodes(xdf, "table > caption") %>% # ASSIGN THE TABLE HEADER TO THE LIST ELEMENT
html_text() %>% # BUT WE NEED TO CLEAN THEM UP FIRST
trimws() %>%
tolower() %>%
gsub("[[:punct:][:space:]]+", "_", .) %>%
gsub("_+", "_", .) %>%
make.unique(sep = "_")
) -> tbls
现在,我们可以按列表中的名称访问它们,而无需使用nigh-never-recommended assign()
(再次是 shudder ):
tbls$products_for_human_use_active_data_protection_period
## # A tibble: 260 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abiraterone … 138343 Zytiga Janssen … N/A 2011-07-27 2017-07-27
## 2 aclidinium brom… 157598 Tudorza Gen… AstraZeneca… Duaklir Genu… 2013-07-29 2019-07-29
## 3 afatinib dimale… 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01
## 4 aflibercept 149321 Eylea Bayer In… N/A 2013-11-08 2019-11-08
## 5 albiglutide 165145 Eperzan GlaxoSmithK… N/A 2015-07-15 2021-07-15
## 6 alectinib hydro… 189442 Alecensaro Hoffmann-La… N/A 2016-09-29 2022-09-29
## 7 alirocumab 183116 Praluent Sanofi-aven… N/A 2016-04-11 2022-04-11
## 8 alogliptin benz… 158335 Nesina Takeda C… "Kazano\r\n … 2013-11-27 2019-11-27
## 9 anthrax immune … 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06
## 10 antihemophilic … 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22
## # ... with 250 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_human_use_expired_data_protection_period
## # A tibble: 92 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abatacept 98531 Orencia Bristol-Mye… N/A 2006-06-29 2012-06-29
## 2 acamprosate cal… 103287 Campral Mylan Pharm… N/A 2007-03-16 2013-03-16
## 3 alglucosidase a… 103381 Myozyme Genzyme Can… N/A 2006-08-14 2012-08-14
## 4 aliskiren hemif… 105388 Rasilez Novartis Ph… "Rasilez HCT\r\… 2007-11-14 2013-11-14
## 5 ambrisentan 113287 Volibris GlaxoSmithK… N/A 2008-03-20 2014-03-20
## 6 anidulafungin 110202 Eraxis Pfizer Cana… N/A 2007-11-14 2013-11-14
## 7 aprepitant 108483 Emend Merck Fross… "Emend Tri-Pack… 2007-08-24 2013-08-24
## 8 aripiprazole 120192 Abilify Bristol-Mye… Abilify Maintena 2009-07-09 2015-07-09
## 9 azacitidine 127108 Vidaza Celgene N/A 2009-10-23 2015-10-23
## 10 besifloxacin 123400 Besivance Bausch & … N/A 2009-10-23 2015-10-23
## # ... with 82 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_veterinary_use_active_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 afoxolaner 163768 Nexgard Merial Cana… Nexgard Spectra 2014-07-08 2020-07-08
## 2 avilamycin 156949 Surmax 100 Pre… Elanco Cana… Surmax 200 Prem… 2014-02-18 2020-02-18
## 3 cefpodoxime pro… 149164 Simplicef Zoetis Cana… N/A 2012-12-06 2018-12-06
## 4 clodronate diso… 172789 Osphos Injecti… Dechra Ltd. N/A 2015-05-06 2021-05-06
## 5 closantel sodium 180678 Flukiver Elanco Divi… N/A 2015-11-24 2021-11-24
## 6 derquantel 184844 Startect Zoetis Cana… N/A 2016-04-27 2022-04-27
## 7 dibotermin alfa… 148153 Truscient Zoetis Cana… N/A 2012-11-20 2018-11-20
## 8 fluralaner 166320 Bravecto Intervet Ca… N/A 2014-05-23 2020-05-23
## 9 gonadotropin re… 140525 Improvest Zoetis Cana… N/A 2011-06-22 2017-06-22
## 10 insulin human (… 150211 Prozinc Boehringer … N/A 2013-04-24 2019-04-24
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
tbls$products_for_veterinary_use_expired_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 acetaminophen 110139 Pracetam 20% O… Ceva Animal… N/A 2009-03-05 2015-03-05
## 2 buprenorphine h… 126077 Vetergesic Mul… Sogeval UK … N/A 2010-02-03 2016-02-03
## 3 cefovecin sodium 110061 Convenia Zoetis Cana… N/A 2007-05-30 2013-05-30
## 4 cephalexin mono… 126970 Vetolexin Vétoquinol … Cefaseptin 2010-06-24 2016-06-24
## 5 dirlotapide 110110 Slentrol Zoetis Cana… N/A 2008-08-14 2014-08-14
## 6 emamectin benzo… 109976 Slice Intervet Ca… N/A 2009-06-29 2015-06-29
## 7 emodepside 112103 / 112106… Profender Bayer Healt… N/A 2008-08-28 2014-08-28
## 8 firocoxib 110661 / 110379 Previcox Merial Cana… N/A 2007-09-28 2013-09-28
## 9 fluoxetine hydr… 109825 / 109826… Reconcile Elanco, Div… N/A 2008-03-28 2014-03-28
## 10 gamithromycin 125823 Zactran Merial Cana… N/A 2010-03-29 2016-03-29
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
每一个中都有N/A
个可以转换为NA
的列,并且每个列都有一个drug_s_containing_the_medicinal_ingredient_variations
共同的列,当观察值不是N/A
时,它就是一个或更多用\r\n
分隔的药物,因此我们可以将其转换为列表列,然后使用tidyr::unnest()
进行后处理:
lapply(tbls, function(x) {
# Make "N/A" into real NAs
x[] <- lapply(x, function(.x) ifelse(.x == "N/A", NA_character_, .x))
# The common `drug_s_containing_the_medicinal_ingredient_variations`
# column - when not N/A - has one drug per-line so we can use that
# fact to turn it into a list column which you can use `tidyr::unnest()` on
x$drug_s_containing_the_medicinal_ingredient_variations <-
lapply(x$drug_s_containing_the_medicinal_ingredient_variations, function(.x) {
strsplit(trimws(.x), "[\r\n]+")
})
x
}) -> tbls