使用BeautifulSoup

时间:2019-07-05 20:58:35

标签: python html web-scraping beautifulsoup tags

我正在使用BeautifulSoup抓取期刊文章的元数据,并且需要检索每篇文章的类别。例如,让我们使用this article。我在下面粘贴了我要解析的代码块。

<div id="landingDetailPluginDiv" class="p20">
  <div class="article_category">CLINICAL</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/measuring-overuse-with-electronic-health-records-data">Measuring Overuse With Electronic Health Records Data</a></div>
    <div class="article_plus">Thomas Isaac, MD, MBA, MPH; Meredith B. Rosenthal, PhD; Carrie H. Colla, PhD; Nancy E. Morden, MD, MPH; Alexander J. Mainor, JD, MPH; Zhonghe Li, MS; Kevin H. Nguyen, MS; Elizabeth A. Kinsella, BA; and Thomas D. Sequist, MD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">FROM THE EDITORS</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/the-health-information-technology-special-issue-has-it-become-a-mandatory-part-of-health-and-healthcare">The Health Information Technology Special Issue: Has IT Become a Mandatory Part of Health and Healthcare?</a></div>
    <div class="article_plus">Jacob Reider, MD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">MANAGERIAL</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/bridging-the-digital-divide-mobile-access-to-personal-health-records-among-patients-with-diabetes">Bridging the Digital Divide: Mobile Access to Personal Health Records Among Patients With Diabetes</a></div>
    <div class="article_plus">Ilana Graetz, PhD; Jie Huang, PhD; Richard J. Brand, PhD; John Hsu, MD, MBA, MSCE; Cyrus K. Yamin, MD; and Mary E. Reed, DrPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">POLICY</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/electronic-health-record-superusers-and-underusers-in-ambulatory-care-practices">Electronic Health Record "Super-Users" and "Under-Users" in Ambulatory Care Practices</a></div>
    <div class="article_plus">Juliet Rumball-Smith, MBChB, PhD; Paul Shekelle, MD, PhD; and Cheryl L. Damberg, PhD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/electronic-sharing-of-diagnostic-information-and-patient-outcomes">Electronic Sharing of Diagnostic Information and Patient Outcomes</a></div>
    <div class="article_plus">Darwyyn Deyo, PhD; Amir Khaliq, PhD; David Mitchell, PhD; and Danny R. Hughes, PhD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/hospital-participation-in-meaningful-use-and-racial-disparities-in-readmissions">Hospital Participation in Meaningful Use and Racial Disparities in Readmissions</a></div>
    <div class="article_plus">Mark Aaron Unruh, PhD; Hye-Young Jung, PhD; Rainu Kaushal, MD, MPH; and Joshua R. Vest, PhD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_category">WEB EXCLUSIVE</div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/a-costeffectiveness-analysis-of-cardiology-econsults-for-medicaid-patients">A Cost-Effectiveness Analysis of Cardiology eConsults for Medicaid Patients</a></div>
    <div class="article_plus">Daren Anderson, MD; Victor Villagra, MD; Emil N. Coman, PhD; Ianita Zlateva, MPH; Alex Hutchinson, MBA; Jose Villagra, BS; and J. Nwando Olayiwola, MD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/electronic-health-record-problem-lists-accurate-enough-for-risk-adjustment">Electronic Health Record Problem Lists: Accurate Enough for Risk Adjustment?</a></div>
    <div class="article_plus">Timothy J. Daskivich, MD, MSHPM; Garen Abedi, MD, MS; Sherrie H. Kaplan, PhD, MPH; Douglas Skarecky, BS; Thomas Ahlering, MD; Brennan Spiegel, MD, MSHS; Mark S. Litwin, MD, MPH; and Sheldon Greenfield, MD</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="article_title"><a href="/journals/issue/2018/2018-vol24-n1/racialethnic-variation-in-devices-used-to-access-patient-portals">Racial/Ethnic Variation in Devices Used to Access Patient Portals</a></div>
    <div class="article_plus">Eva Chang, PhD, MPH; Katherine Blondon, MD, PhD; Courtney R. Lyles, PhD; Luesa Jordan, BA; and James D. Ralston, MD, MPH</div>
    <div class="fc"></div>
  </div>
  <div class="borderBottom"></div>
  <div class="article_text">
    <div class="current_article fl">
      <div class="article_title">Currently Reading</div>
      <div class="article_title b">Hospitalized Patients' and Family Members' Preferences for Real-Time, Transparent Access to Their Hospital Records</div>
      <div class="article_plus b">Michael J. Waxman, MD, MPH; Kurt Lozier, MBA; Lana Vasiljevic, MS; Kira Novakofski, PhD; James Desemone, MD; John O'Kane, RRT-NPS, MBA; Elizabeth M. Dufort, MD; David Wood, MBA; Ashar Ata, MBBS, PhD; Louis Filhour, PhD, RN; & Richard J. Blinkhorn
        Jr, MD</div>

从摘要中可以看到,有多个元素,因为每期文章的目录都列在每篇文章网页的侧面板中。我只想检索特定于该文章的文章类别,所以这意味着我需要检索<div class="article_category">(住院患者及其家庭成员)之前的最后一个<div class="article_title b">(在这种情况下为WEB EXCLUSIVE)实时,透明地访问其医院记录的偏好设置)。我不确定这些元素是否应被视为兄弟姐妹。

2 个答案:

答案 0 :(得分:0)

要从侧边栏中检索本文的类别(WEB EXCLUSIVE),您可以尝试使用以下代码(我们首先选择文章的标题,然后在右侧的边栏中找到适当的div和上一个标签是文章类别):

import requests
from bs4 import BeautifulSoup

url = 'https://www.ajmc.com/journals/issue/2018/2018-vol24-n1/hospitalized-patients-and-family-members-preferences-for-realtime-transparent-access-to-their-hospital-records'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

title = soup.title.text
d = soup.select_one('#rightTabContent div.article_title:contains("{}")'.format(title))
print(d.find_previous('div', class_='article_category').text)

打印:

WEB EXCLUSIVE

进一步阅读:

CSS Selector Reference

答案 1 :(得分:0)

您可以使用:has和:contains通过标题指定要匹配的元素,然后获取前面的div。 +是相邻的同级组合器,因此我们指定要在通过商品标题(.article_text:contains("A Cost-Effectiveness Analysis of Cardiology eConsults for Medicaid Patients")的匹配返回的匹配元素之前紧跟该元素。


import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.ajmc.com/journals/issue/2018/2018-vol24-n1/hospitalized-patients-and-family-members-preferences-for-realtime-transparent-access-to-their-hospital-records')
soup = bs(r.content, 'lxml')
category = soup.select_one('.article_category:has(+.article_text:contains("A Cost-Effectiveness Analysis of Cardiology eConsults for Medicaid Patients"))').text
print(category)