Question

我已经用python编写了一个脚本，以从网页中抓取b标签和next_sibling中的一些杂乱无章的内容。问题是当换行符出现时我的脚本失败。我正在尝试从title到description之前的页面中提取CHIEF COMPLAINT: Bright red blood per rectum及其相关的Keywords:。

Website address

到目前为止，我已经尝试过：

import requests
from bs4 import BeautifulSoup

url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'

res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
    print(item.text,item.next_sibling)

给我不想要的结果的输出部分就像：

LABS:  <br/>
CBC:  <br/>
CHEM 7:  <br/>

我如何相应地获得标题及其相关描述？

Answer 1

这里的刮板比昨天的解决方案更坚固。

How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?

它会正确提取，标题，描述和所有部分

import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor

BASE_URL = 'https://www.mtsamples.com'


def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    return soup

def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
    soup = copy.copy(soup)
    h1 = soup.select_one('h1')
    kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
    kw = soup.find('b', text=kw_re)

    for el in (*h1.previous_siblings, *kw.next_siblings):
        el.extract()
    kw.extract()
    for ad in soup.select('[id*="ad"]'):
        ad.extract()
    for script in soup.script:
        script.extract()
    for c in h1.parent.children:
        if isinstance(c, Comment):
            c.extract() 
    return h1.parent

def extract_meta(soup: BeautifulSoup) -> dict:
    h1 = soup.select_one('h1')
    title = h1.text.strip()

    desc_parts = []
    desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
    desc = soup.find('b', text=desc_re)
    hr = soup.select_one('hr')
    for s in desc.next_siblings:
        if s is hr:
            break
        if isinstance(s, NavigableString):
            desc_parts.append(str(s).strip())
        elif isinstance(s, Tag):
            desc_parts.append(s.text.strip())
    description = '\n'.join(p.strip() for p in desc_parts if p.strip())

    return {
        'title': title,
        'description': description
    }

def extract_sections(soup: BeautifulSoup) -> list:
    titles = [b for b in soup.select('b') if b.text.isupper()]

    parts = []
    for t in titles:
        title = t.text.strip(': ').title()
        text_parts = []
        for s in t.next_siblings:
            # walk forward until we see another title
            if s in titles:
                break
            if isinstance(s, Comment):
                continue
            if isinstance(s, NavigableString):
                text_parts.append(str(s).strip())
            if isinstance(s, Tag):
                text_parts.append(s.text.strip())
        text = '\n'.join(p for p in text_parts if p.strip())
        p = {
            'title': title,
            'text': text
        }
        parts.append(p)
    return parts

def extract_page(url: str) -> dict:
    soup = make_soup(url)
    clean = clean_soup(soup)
    meta = extract_meta(clean)
    sections = extract_sections(clean)
    return {
        **meta,
        'sections': sections
    }


url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)

输出：

{'description': 'Status post colonoscopy.  After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
 'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
              # some elements removed for brevity
              {'text': '', 'title': 'Labs'},
              {'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
              {'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
              {'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
              {'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
 'title': 'Sample Type / Medical Specialty:  Gastroenterology\nSample Name: Blood per Rectum'}

Answer 2

代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&   Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')

# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header

输出：

Sample Type / Medical Specialty:  Gastroenterology
Sample Name: Blood per Rectum

代码：

# Find all the <b> tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
    print(b.text, b.next_sibling)

输出：

CHIEF COMPLAINT:  Bright red blood per rectum 
HISTORY OF PRESENT ILLNESS:  This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation. 
REVIEW OF SYMPTOMS:  No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission. 
PRIOR MEDICAL HISTORY:  Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement. 
MEDICATIONS:  Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier. 
ALLERGIES:  Celecoxib (rash).
SOCIAL HISTORY:  Resided at nursing home. Denied alcohol, tobacco, and drug use. 
FAMILY HISTORY:  Non-contributory.
PHYSICAL EXAM:  <br/>
LABS:  <br/>
CBC:  <br/>
CHEM 7:  <br/>
COAGULATION STUDIES:  <br/>
HOSPITAL COURSE:  The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism. 
Keywords:  
    gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen, 

 NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
        are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
        These transcribed medical transcription sample reports may include some uncommon or unusual formats;
        this would be due to the preference of the dictating physician. All names and dates have been
        changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
        place or anything else to real world is purely incidental.

换行符起作用时，脚本会产生错误的结果

2 个答案: