使用未包含在标签中的find_next _siblings / text提取Beautiful Soup跳过的文本

时间:2020-09-12 00:21:13

标签: python beautifulsoup screen-scraping

我正在尝试使用Beautiful Soup从网上的旧分类页面中提取信息。我之所以特别提到它,是因为我可以想象到,HTML标准可能已经发生了变化,或者可能会影响执行此操作的方式。看来问题的一部分可能是文本没有包含在任何标签中。

以下是页面HTML外观的示例:

<h5>REAL ESTATE</h5>
<hr/><b>SANTA FE REALTOR</b> seeks culturally astute clients interested in relocation or second home. Contact Susan: <a href=“EMAIL”>EMAIL</a> or PHONE

<hr/>
<h5>RENTALS</h5>
<hr/><b>NYC. GREENWICH VILLAGE.</b> Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or <a href=“EMAIL”>EMAIL</a>.
<hr/><b>E. 71st &amp; PARK.</b> Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
<hr/><b>BERKSHIRES—</b>extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
<hr/><b>SPECTACULAR VIEW OVER MANHATTAN.</b> Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
<hr/><b>DEMOCRATIC CONVENTION—</b>Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. <a href=“EMAIL”> EMAIL </a> or PHONE.

<hr/> <h5>INTERNATIONAL RENTALS</h5>
<hr/><b>SUPERB SABBATICALS</b> and vacation rentals: flats/houses, Paris, French countryside, Riviera, London, Tuscany, more; no exchanges. Two-week minimum. Over twenty years experience. <i> Abroad, Inc., Riverside Drive, New York, NY, tel . <a href=“website">website</a>.</i>
<hr/>
<b>CHARMING HOUSE—TODI, ITALY.</b> 4 bedrooms, fireplaces, garden, breathtaking views, parking. Tel:; fax: ; e-mail:
<a href=“EMAIL”>EMAIL</a>.
<hr/><b>PARIS-MARAIS</b> Musée Picasso. Archives Nationales. Very attractive one bedroom, large living room, den, bathroom, kitchen, all appliances. Nonsmokers. Biweekly/monthly/sabbaticals. PHONE

我想做的是将“出租”部分中每个列表的文本提取为列表中的单独项目。

似乎可以通过对标头的同级元素使用某种解析组合来完成。

但是,当我运行代码时:

soup = BeautifulSoup(contents, 'html')
target=soup.find("h5",text="RENTALS")
listingtext=[]
for sib in target.find_next_siblings():
    if sib.name=="h5":
        break
    elif not sib.text:
        pass
    else:
        listingtext.append(sib.text)

我所得到的是清单和电子邮件地址的所有粗体标头文本的列表,这是标签中包含的所有文本。 即我得到:

["NYC. GREENWICH VILLAGE.","EMAIL",'E. 71st & PARK.', 'BERKSHIRES—','SPECTACULAR VIEW OVER MANHATTAN.','COLD SPRING, NEW YORK.', 'DEMOCRATIC CONVENTION—','EMAIL']

我真正想要的是一个看起来像这样的列表

['NYC. GREENWICH VILLAGE. Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or EMAIL','E. 71st & PARK. Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.' ... ]

似乎我遇到的问题是由于文本未封闭,这影响了BeautifulSoup解析文本的方式。似乎我还可能需要弄清楚如何使用该标签(该标签在页面上用于在列表之间放置行)来分隔每个列表。

2 个答案:

答案 0 :(得分:1)

您可以使用此示例仅从'RENTALS'部分中解析我们的信息:

from bs4 import BeautifulSoup, Tag


txt = '''<h5>REAL ESTATE</h5>
<hr/><b>SANTA FE REALTOR</b> seeks culturally astute clients interested in relocation or second home. Contact Susan: <a href=“EMAIL”>EMAIL</a> or PHONE

<hr/>
<h5>RENTALS</h5>
<hr/><b>NYC. GREENWICH VILLAGE.</b> Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or <a href=“EMAIL”>EMAIL</a>.
<hr/><b>E. 71st &amp; PARK.</b> Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
<hr/><b>BERKSHIRES—</b>extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
<hr/><b>SPECTACULAR VIEW OVER MANHATTAN.</b> Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
<hr/><b>DEMOCRATIC CONVENTION—</b>Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. <a href=“EMAIL”> EMAIL </a> or PHONE.

<hr/> <h5>INTERNATIONAL RENTALS</h5>
<hr/><b>SUPERB SABBATICALS</b> and vacation rentals: flats/houses, Paris, French countryside, Riviera, London, Tuscany, more; no exchanges. Two-week minimum. Over twenty years experience. <i> Abroad, Inc., Riverside Drive, New York, NY, tel . <a href=“website">website</a>.</i>
<hr/>
<b>CHARMING HOUSE—TODI, ITALY.</b> 4 bedrooms, fireplaces, garden, breathtaking views, parking. Tel:; fax: ; e-mail:
<a href=“EMAIL”>EMAIL</a>.
<hr/><b>PARIS-MARAIS</b> Musée Picasso. Archives Nationales. Very attractive one bedroom, large living room, den, bathroom, kitchen, all appliances. Nonsmokers. Biweekly/monthly/sabbaticals. PHONE'''

soup = BeautifulSoup(txt, 'html.parser')

for hr in soup.select('hr'):
    if hr.find_previous('h5') is None or hr.find_previous('h5').text != 'RENTALS':
        continue

    out, s = [], hr.next_sibling
    while not s is None and not (isinstance(s, Tag) and s.name in ('hr', 'h5')):
        if isinstance(s, Tag):
            out.append(s.get_text(strip=True))
        elif s.strip():
            out.append(s.strip())
        s = s.next_sibling

    if out:
        print(' '.join(out))
        print('-' * 80)

打印:

NYC. GREENWICH VILLAGE. Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or EMAIL .
--------------------------------------------------------------------------------
E. 71st & PARK. Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
--------------------------------------------------------------------------------
BERKSHIRES— extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
--------------------------------------------------------------------------------
SPECTACULAR VIEW OVER MANHATTAN. Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
--------------------------------------------------------------------------------
DEMOCRATIC CONVENTION— Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. EMAIL or PHONE.
--------------------------------------------------------------------------------

答案 1 :(得分:0)

提供您要抓取的“网址”并

我将编辑此答案,并为您提供正确的输出方式