bs4抓取python获取内容直到特定的类名

时间:2018-09-05 11:50:04

标签: python class beautifulsoup screen-scraping

我想抓取这个网站 https://www.eduvision.edu.pk/institutions-detail.php?city=51I&institute=5_allama-iqbal-open-university-islamabad 我只想要此URL中位于类名= academicsList下的单身汉数据,而我不想在MS(MASTERS)数据下。 我希望我的刮板在ms数据之前停止。我的逻辑是,我们可以在class = academicsHead上设置临时增量器,并且在获得第二个AcademicHead时应停止

   import requests
from bs4 import BeautifulSoup
from fake_useragent import  UserAgent
ua          = UserAgent()
header      = {'user-agent':ua.chrome}
response   = requests.get('https://www.eduvision.edu.pk/institutions-detail.php?city=51I&institute=5_allama-iqbal-open-university-islamabad',headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
disciplines = soup.findAll("ul", {"class": "academicsList"})
#temp = soup.findAll("ul",{"class":"academicsHead"})
#stop at second academicsHead
for d in disciplines:
    print(d.findAll('li')[0].text)

1 个答案:

答案 0 :(得分:0)

我们可以检查该类是否为'academicsHead',是否只是检查文本是否为BACHELOR,否则不会中断循环。 这样的事情会起作用:

backend : pro