Question

我是使用R进行网络抓取的新手，但我仍然遇到这个问题：我想使用R向PubMed提交搜索查询，然后从结果页面下载CSV文件。可以通过单击“发送到”（打开一个下拉菜单）来访问CSV文件，然后我需要选择“文件”单选按钮，将“格式”选项更改为“ CSV”（选项6），最后单击“创建文件”按钮开始下载。

一些注意事项：
1.是的，这种远程搜索和下载符合NCBI的政策。
2.为什么不使用easyPubMed包？我已经尝试过了，并将其用于我的另一部分工作。但是，使用此程序包检索搜索结果会丢失CSV下载包含的某些文章元数据。

我已经查看了以下相关问题：Download csv file from webpage after submitting form from dropdown using rvest package in R，R Download .csv file tied to input boxes and a "click" button，Using R to "click" a download file button on a webpage。

我认为@hrbrmstr提供的先前解决方案包含了答案，但是我无法将所有内容组合在一起以下载CSV文件。

我认为解决此问题的最佳方法是分两步进行：1）POST对PubMed的搜索请求和GET的结果，以及2）再提交POST请求到结果页面（或在其中导航），并选择所需的选项以下载CSV文件。我用玩具搜索查询尝试了以下操作（“ hello world”，带引号，目前返回6个结果）...

query <- '"hello world"'
url <- 'https://www.ncbi.nlm.nih.gov/pubmed/'

html_form(html_session(url)) # enter query using 'term'
# post search and retrieve results
session <- POST(url,body = list(term=query),encode='form')

# scrape results to check that above worked
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>% 
  html_text()
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>% 
  html_nodes('p') %>% html_text()

# view html nodes of dropdown menu -- how to 'click' these via R?
content(session) %>% html_nodes('#sendto > a')
content(session) %>% html_nodes('#send_to_menu > fieldset > ul > li:nth-child(1) > label')
content(session) %>% html_nodes('#file_format')
content(session) %>% html_nodes('#submenu_File > button')

# submit request to download CSV file
POST(session$url, # I know this doesn't work, but I would hope something similar is possible
     encode='form',
     body=list('EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo'='File',
               'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat'=6,
               'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit'=1),
     write_disk('results.csv'))

上面的最后一行失败-已下载CSV文件，但其中包含POST请求的html结果。理想情况下，如何编辑最后一行以获得所需的CSV文件？

***可能的黑客攻击直接跳到了结果页面。换句话说，我知道提交“ hello world”搜索将返回以下URL：https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22。因此，如有必要，我可以从此处推断并构建基于搜索查询的结果URL。

我已尝试将此URL插入到上面的行中，但是它仍未返回所需的CSV文件。我可以使用以下命令查看表单字段...

# view form options on the results page
html_form(html_session('https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22'))

或者，是否可以知道上面的表单选项来展开URL？像...

url2 <- 'https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo=File&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat=6&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit=1'
POST(url2,write_disk('results2.csv'))

我希望下载一个包含6个包含文章元数据的结果的CSV文件，但是，我得到的是结果页面的html。

任何帮助将不胜感激！谢谢。

Answer 1

如果我将您的问题改写为：“我想使用R向PubMed提交搜索查询，然后下载与CSV下载选项中提供的信息相同的信息页。”

然后，我认为您可以跳过抓取和Web UI自动化，而直接转到API that NIH has provided for this purpose。

此R代码的第一部分进行相同的搜索（“ hello world”），并以JSON格式获得相同的结果（可以将search_url链接粘贴到浏览器中进行验证）。

library(httr)
library(jsonlite)
library(tidyverse)

# Search for "hello world"
search_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=%22hello+world%22&format=json"

# Search for results
search_result <- GET(search_url)

# Extract the content
search_content <- content(search_result, 
                          type = "application/json",
                          simplifyVector = TRUE)

# search_content$esearchresult$idlist
# [1] "29725961" "28103545" "27567633" "25955529" "22999052" "19674957"

# Get a vector of the search result IDs
result_ids <- search_content$esearchresult$idlist

# Get a summary for id 29725961 (the first one).
summary_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&version=2.0&id=29725961&format=json" # 

summary_result <- GET(summary_url)

# Extract the content
summary_content <- content(summary_result, 
                          type = "application/json")

大概您可以从这里获取它，因为列表summary_content包含了您所需的信息，只是格式不同（我通过目视检查已验证）。

但是，为了符合原始问题的精神（使用R，通过从NCBI拉出来使用CSV来提供CSV），可以使用以下一些步骤来重现与获得的CSV完全相同的CSV来自PubMed Web UI for humans。

# Quickie cleanup (thanks to Tony ElHabr)
# https://www.r-bloggers.com/converting-nested-json-to-a-tidy-data-frame-with-r/
summary_untidy <- enframe(unlist(summary_content))

# Get rid of *some* of the fluff...
summary_tidy <- summary_untidy %>% 
  filter(grepl("result.29725961", name)) %>% 
  mutate(name = sub("result.29725961.", "", name))

# Convert the multiple author records into a single comma-separated string.
authors <- summary_tidy %>% 
  filter(grepl("^authors.name$", name)) %>% 
  summarize(pasted = paste(value, collapse = ", "))

# Begin to construct a data frame that has the same information as the downloadable CSV
summary_csv <- tibble(
  Title = summary_tidy %>% filter(name == "title") %>% pull(value),
  URL = sprintf("/pubmed/%s", summary_tidy %>% filter(name == "uid") %>% pull(value)),
  Description = pull(authors, pasted),
  Details = "... and so on, and so on, and so on... "
)

# Write the sample data frame to a csv.
write_csv(summary_csv, path = "just_like_the_search_page_csv.csv")

我对您提到的easyPubMed包不熟悉，但是digging through the easyPubMed code启发了我使用NCBI API。完全有可能您可以修复/调整某些easyPubMed代码，以提取希望从提取一堆CSV中获得的其他元数据。（那里没有很多。只有500行代码定义了8个函数。）

哎呀，如果您设法调整easyPubMed代码以提取其他元数据，我建议您将所做的更改交还给作者，以便他们改进其软件包！

Answer 2

使用easyPubMed软件包：

library(easyPubMed)
out <- batch_pubmed_download(pubmed_query_string = "hello world")
DF <- table_articles_byAuth(pubmed_data = out[1])
write.csv(DF, "helloworld.csv")

有关详细信息，请参见easyPubMed中的插图和帮助文件。

其他软件包在CRAN上为pubmed.mineR，rentrez和RISmed，在github上为Bioconductor和Rcupcake进行注释。

从结果页面下载CSV文件，其中包含下拉菜单中的选项

2 个答案: