BeautifulSoup-避免使用findAll将包含<br/>的元素视为不同的元素

时间:2019-03-14 13:46:00

标签: python html beautifulsoup

我有以下html片段:

<div id="targetdown" class="content">
    <div class="alertbox">
        <div class="ym-wrapper">
            <div class="ym-wbox">

            </div>
        </div>
    </div>
    <div class="ym-wrapper">
        <div class="ym-wbox">
            <p style="text-align: center;">EXCEL Physical Therapy has been keeping our patients moving forward<br />
for nearly 30 years. In the process, we have built an unparalleled<br />
reputation&nbsp;by combining the highest quality of physical therapy<br />
with exceptional&nbsp;customer service to provide a genuinely<br />
&ldquo;patient first&rdquo; approach.&nbsp;It is this philosophy&nbsp;that has established<br />
EXCEL&nbsp;as&nbsp;a premier physical therapy provider in Northern New Jersey.</p>
        </div>
    </div>
</div>
<section class="parallaxone parallax">
    <div class="ym-wrapper">
        <div class="ym-wbox">
            <h2>Helping you navigate the road to recovery</h2>


        </div>
    </div>
</section>

我想从存在的元素中获取文字,但不要考虑断行时它是一个新元素。

我正在执行以下操作:

'
In [19]: html = '<div id="targetdown" class="content"><div class="alertbox"><div class="ym-wrapper"><div class="ym-wbox"></div></div></div><div class="ym-wrapper"><div class="ym-wbox"><p style="text-align: center;">EXCEL Physical Therapy has been keeping our patients moving forward<br />for nearly 30 years. In the process, we have built an unparalleled<br /> reputation&nbsp;by combining the highest quality of physical therapy<br /> with exceptional&nbsp;customer service to provide a genuinely<br /> &ldquo;patient first&rdquo; approach.&nbsp;It is this philosophy&nbsp;that has established<br /> EXCEL&nbsp;as&nbsp;a premier physical therapy provider in Northern New Jersey.</p></div></div></div><section class="parallaxone parallax"><div class="ym-wrapper"><div class="ym-wbox"><h2>Helping you navigate the road to recovery</h2> </div></div></section>
    ...: soup = BeautifulSoup(html)
    ...: texts = soup.findAll(text=True)

结果是:

In [20]: texts
Out[20]:
['EXCEL Physical Therapy has been keeping our patients moving forward',
 'for nearly 30 years. In the process, we have built an unparalleled',
 ' reputation\xa0by combining the highest quality of physical therapy',
 ' with exceptional\xa0customer service to provide a genuinely',
 ' “patient first” approach.\xa0It is this philosophy\xa0that has established',
 ' EXCEL\xa0as\xa0a premier physical therapy provider in Northern New Jersey.',
 'Helping you navigate the road to recovery',
 ' ']

如何避免在换行符中进行拆分,以使文本

  

EXCEL物理疗法一直使我们的患者前进近30年。在此过程中,我们建立了一个   无与伦比的
声誉,结合了最高的质量   物理治疗
,并提供卓越的客户服务   提供真正的
“患者至上”   方法。正是这种哲学确立了
  EXCEL作为北部新区的主要物理治疗提供者   泽西岛。

是否作为列表中的单个元素返回?

1 个答案:

答案 0 :(得分:1)

您可以这样做:

soup.find_all("div", class_="ym-wbox")[1].find("p").text