尝试通过网络抓取PubMed,但我需要直通“第2页”,嗯,我不太确定使用哪种代码。
所以,我看了以下链接:Web Scraping - Get to Page 2
我非常确定它可以解决问题,只是我不知道该如何在我的情况下实施它。使用什么变量以及发送什么?
所有其他有关网络抓取和PubMed的帖子都是关于不同的事情。
我的代码:
import requests
from bs4 import BeautifulSoup
params = {
'name': "EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page",
'title': "Next page of results",
'class': "active page_link next",
'href': "#",
'sid': 3,
'page': 3,
'accesskey': "k",
'id': "EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page"
}
page_link = 'https://www.ncbi.nlm.nih.gov/pubmed/?term=emergency+nurse+AND+pain'
page_response = requests.get(page_link, timeout=5, params=params)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content)
“下一步”按钮调用的代码(这是我从第2页开始的代码):
<a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">Next ></a>
its a part of all of this:
<div class="title_and_pager">
<div><h2>Search results</h2><h3 class="result_count left">Items: 201 to 400 of 367719</h3><span id="result_sel" class="nowrap"></span><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.ResultCount" sid="1" type="hidden" id="resultcount" value="367719" /><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.RunLastQuery" sid="1" type="hidden" /></div>
<div class="pagination"><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="First page of results" class="active page_link" href="#" sid="1" page="1" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page"><< First</a><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Previous page of results" class="active page_link prev" href="#" sid="2" page="1" accesskey="j" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">< Prev</a><h3 class="page"><label for="pageno">Page </label><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage" id="pageno" type="text" class="num" sid="1" value="2" last="1839" /> of 1839</h3><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">Next ></a><a name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page" title="Last page of results" class="active page_link" href="#" sid="4" page="1839" id="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.Page">Last >></a><input name="EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage" sid="1" type="hidden" value="2" /></div>
</div>
我显然可以从“页面1”中抓取所有内容,但我需要抓取所有页面。我只需要提示如何设置我,而不是整个代码都可以完美工作。我知道你们还有更好的事情要做。
答案 0 :(得分:1)
我注意到您尝试阅读的网站的网址中有一个模式。对于每个页面,URL结尾更改为page=NUMBER
。因此,第一页具有URL:
“ https://www.ncbi.nlm.nih.gov/pubmed/?term=emergency+nurse+AND+pain”
我发现与以下链接相同:
“ https://pubmed.ncbi.nlm.nih.gov/?term=emergency%20nurse%20AND%20pain&page=1”
第2页的URL:
“ https://pubmed.ncbi.nlm.nih.gov/?term=emergency%20nurse%20AND%20pain&page=2”
以此类推。您可以循环浏览这85页,并通过简单的for
循环对其进行扫描:
import requests
for i in range(84):
response = requests.get(url="https://pubmed.ncbi.nlm.nih.gov/?term=emergency%20nurse%20AND%20pain&page=" + str(i + 1))
# read page...
如果您有任何疑问,请告诉我!希望我能为您服务!