Question

我在Windows 7机器上使用Python34中的BeautifulSoup。我有以下内容，我正在尝试解析

<bound method Tag.find of <div class="accordion">
<p> <span style="color:039; font-size:14px; font-weight:bold">Acetohydroxamic Acid (Lithostat) Tablets</span><br/><br/>



  <strong>Status: Currently in Shortage </strong><br/><br/>



         » <strong>Date first posted</strong>: 

        07/15/2014<br/>



 » <strong>Therapeutic Categories</strong>: Renal<br/>
</p><p style="padding:10px;">
</p>
<h3>

    Mission Pharmacal  (<em>Reverified  01/21/2015</em>)

我试图在Date首次发布之后将“07/15/2014”排除在外。我也必须把“肾脏”拿出来。我可以使用.findAll（“strong”）找到所有“强者”，但我无法想出在/ strong＆gt;之后和下一次之前得到某种东西的方法。

Answer 1

为什么不使用正则表达式(?<=/strong>:)([^<]+)。第一组中的?<=意味着它是正面的后视：这意味着“查找此字符串但不捕获它”。第二组意味着“匹配<以外的任何字符。最后，strip会删除群组周围的任何额外空格。

import re
import requests
s = requests.get(url).text
matches = [l.strip() for l in re.findall('(?<=/strong>:)([^<]+)',s)]

Answer 2

您需要使用.next_sibling来获取strong之后的元素 isinstance(el, bs4.Tag)过滤非Tag元素，最后re.sub删除空行和:

In [38]: import re

In [39]: import bs4

In [40]: from bs4 import BeautifulSoup

In [41]: soup = BeautifulSoup("""<bound method Tag.find of <div class="accordion">   ....: <p> <span style="color:039; font-size:14px; font-weight:bold">Acetohydroxamic Acid (Lithostat) Tablets</span><br/><br/>
   ....: 
   ....: 
   ....: 
   ....:   <strong>Status: Currently in Shortage </strong><br/><br/>
   ....: 
   ....: 
   ....: 
   ....:         » <strong>Date first posted</strong>: 
   ....: 
   ....:                07/15/2014<br/>
   ....: 
   ....:     
   ....: 
   ....:  » <strong>Therapeutic Categories</strong>: Renal<br/>
   ....: </p><p style="padding:10px;">
   ....: </p>
   ....: <h3>
   ....: 
   ....:        Mission Pharmacal  (<em>Reverified  01/21/2015</em>)""")

In [42]: for strong_tag in soup.find_all('strong'):
   ....:     if not isinstance(strong_tag.next_sibling, bs4.Tag):
   ....:         print(re.sub(r'[:\s]+', '', strong_tag.next_sibling))
   ....:         
07/15/2014
Renal

修改

有没有办法在不使用循环的情况下获得该日期？

是的，您可以将text参数用于find。

re.sub('[:\s+]', '', soup.find('strong', text=re.compile('Date')).next_sibling)

使用BeautifulSoup在HTML中查找结束标记内容

2 个答案: