我想从列'名称'中提取超文本和超链接。在下表中:European Medicines Agency。 我的目标是创建一个数据框,其中包含一列名称和另一列链接。 使用下面的代码,我能够收集超链接,但我迷失了如何匹配实际名称的链接?
library(rvest)
library(dplyr)
page <- read_html('http://www.ema.europa.eu/ema/index.jsp?curl=pages/medicines/landing/smop_search.jsp&mid=WC0b01ac058001d127&startLetter=View%20all&applicationType=Initial%20authorisation&applicationType=Post%20authorisation&keyword=Enter%20keywords&keyword=Enter%20keywords&searchkwByEnter=false&searchType=Name&alreadyLoaded=true&status=Positive&status=Negative&jsenabled=false&orderBy=opinionDate&pageNo=1') %>%
html_nodes('tbody a') %>% html_attr('href')
dfpage <- data.frame(page)
答案 0 :(得分:1)
library(rvest)
library(tidyverse)
url_template <- "http://www.ema.europa.eu/ema/index.jsp?searchType=Name&applicationType=Initial+authorisation&applicationType=Post+authorisation&searchkwByEnter=false&mid=WC0b01ac058001d127&status=Positive&status=Negative&keyword=Enter+keywords&keyword=Enter+keywords&alreadyLoaded=true&curl=pages%%2Fmedicines%%2Flanding%%2Fsmop_search.jsp&startLetter=View+all&pageNo=%s"
获取总页数:
first <- sprintf(url_template, 1)
pg <- read_html(first)
html_nodes(pg, "div.pagination > ul > li:not([class])") %>%
tail(1) %>%
html_text(trim = TRUE) %>%
as.numeric() -> total_pages
将来只有3个但很多,所以设置一个进度条来娱乐你并刮掉桌子,然后提取链接并将其添加到表格中:
pb <- progress_estimated(total_pages)
sprintf(url_template, 1:total_pages) %>%
map_df(function(URL) {
pb$tick()$print()
pg <- read_html(URL)
html_table(pg, trim = TRUE) %>%
.[[1]] %>%
set_names(c("name", "active_substance", "inn", "adopted", "outcome")) %>%
as_tibble() %>%
mutate(url = html_nodes(pg, "th[scope='row'] > a") %>% html_attr("href"))
}) -> pending_df
glimpse(pending_df)
## Observations: 67
## Variables: 6
## $ name <chr> "Lifmior", "Tamiflu", "Jylamvo", "Terrosa", "...
## $ active_substance <chr> "etanercept", "oseltamivir", "methotrexate", ...
## $ inn <chr> "etanercept", "oseltamivir", "methotrexate", ...
## $ adopted <chr> "2016-12-15", "2015-03-26", "2017-01-26", "20...
## $ outcome <chr> "Positive", "Positive", "Positive", "Positive...
## $ url <chr> "index.jsp?curl=pages/medicines/human/medicin...
答案 1 :(得分:0)
我会使用以下代码:
library(rvest)
library(tidyverse)
page <- read_html('http://www.ema.europa.eu/ema/index.jsp?curl=pages/medicines/landing/smop_search.jsp&mid=WC0b01ac058001d127&startLetter=View%20all&applicationType=Initial%20authorisation&applicationType=Post%20authorisation&keyword=Enter%20keywords&keyword=Enter%20keywords&searchkwByEnter=false&searchType=Name&alreadyLoaded=true&status=Positive&status=Negative&jsenabled=false&orderBy=opinionDate&pageNo=1') %>%
html_nodes("table") %>%
rvest::html_table()
data <- as_data_frame(page[[1]])
page_link <- read_html('http://www.ema.europa.eu/ema/index.jsp?curl=pages/medicines/landing/smop_search.jsp&mid=WC0b01ac058001d127&startLetter=View%20all&applicationType=Initial%20authorisation&applicationType=Post%20authorisation&keyword=Enter%20keywords&keyword=Enter%20keywords&searchkwByEnter=false&searchType=Name&alreadyLoaded=true&status=Positive&status=Negative&jsenabled=false&orderBy=opinionDate&pageNo=1') %>%
html_nodes(".key-detail a , .alt~ .alt th") %>%
html_attr('href')
link <- as_data_frame(page_link)
links <- as_data_frame(link$value[-1])
result <- cbind(data, links)
final <- result[, c("Name", "value")]
final
的第一行产生:
print(t(final[1,]))
...
1
Name "Natpar"
value "index.jsp?curl=pages/medicines/human/medicines/003861/smops/Positive/human_smop_001096.jsp&mid=WC0b01ac058001d127"
我希望有所帮助。顺便说一下,我使用Chrome的SelectorGadget加载项获得了正确的标记。