如何用python beautifulSoup刮深嵌入的链接

时间:2015-01-10 00:23:44

标签: python html web-scraping beautifulsoup html-parsing

我正在尝试为学术目的构建一个蜘蛛/网络爬虫来从学术出版物中获取文本并将相关链接附加到URL堆栈。我正在尝试抓取一个名为'PubMed'的网站。我似乎无法抓住我需要的链接。这是我的代码页面,这个页面应该代表他们数据库中的其他人:

 website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
 from bs4 import BeautifulSoup
 import requests
 r = requests.get(website)
 soup = BeautifulSoup(r.content)

为了便于阅读,我将html树分解为几个变量,以便它可以适合1个屏幕宽度。

 key_text = soup.find('div', {'class':'grid'}).find('div',{'class':'col twelve_col nomargin shadow'}).find('form',{'id':'EntrezForm'})
 side_column = key_text.find('div', {'xmlns:xi':'http://www.w3.org/2001/XInclude'}).find('div', {'class':'supplemental col three_col last'})
 side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]

 for link in side_links:
      print link

如果你使用chrome inspect元素查看html源代码,那么应该有几个其他嵌套的div,其中包含'side_links'中的链接。但是,上面的代码会产生以下错误:

 Traceback (most recent call last):
 File "C:/Users/ballbag/Copy/web_scraping/google_search.py", line 22, in <module>
 side_links = side_column.find('div').findAll('div')[1].find('div',      {'id':'disc_col'}).findAll('div')[1]
 IndexError: list index out of range

如果你去网址,右侧有一个名为“相关链接”的栏目,其中包含我想要搜集的网址。但我似乎无法接近他们。有一个声明说在我试图进入的div下,我怀疑这与它有关。有人可以帮忙抓住这些链接吗?我非常感谢任何指针

1 个答案:

答案 0 :(得分:3)

问题是侧栏加载了额外的异步请求。

这里的想法是:

  • 使用requests.Session
  • 维护网络抓取会话
  • 解析用于获取侧栏的网址
  • 点击该链接,获取divclass="portlet_content"
  • 的链接

代码:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests


base_url = 'http://www.ncbi.nlm.nih.gov'
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'

# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content)

url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])

# parsing the side bar
soup = BeautifulSoup(session.get(url).content)

for a in soup.select('div.portlet_content ul li.brieflinkpopper a'):
    print a.text, urljoin(base_url, a.get('href'))

打印:

The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
...
Cited in PMC http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23265702/citedby/?tool=pubmed