我正在尝试使用Beautiful Soup从网上的旧分类页面中提取信息。我之所以特别提到它,是因为我可以想象到,HTML标准可能已经发生了变化,或者可能会影响执行此操作的方式。看来问题的一部分可能是文本没有包含在任何标签中。
以下是页面HTML外观的示例:
<h5>REAL ESTATE</h5>
<hr/><b>SANTA FE REALTOR</b> seeks culturally astute clients interested in relocation or second home. Contact Susan: <a href=“EMAIL”>EMAIL</a> or PHONE
<hr/>
<h5>RENTALS</h5>
<hr/><b>NYC. GREENWICH VILLAGE.</b> Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or <a href=“EMAIL”>EMAIL</a>.
<hr/><b>E. 71st & PARK.</b> Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
<hr/><b>BERKSHIRES—</b>extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
<hr/><b>SPECTACULAR VIEW OVER MANHATTAN.</b> Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
<hr/><b>DEMOCRATIC CONVENTION—</b>Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. <a href=“EMAIL”> EMAIL </a> or PHONE.
<hr/> <h5>INTERNATIONAL RENTALS</h5>
<hr/><b>SUPERB SABBATICALS</b> and vacation rentals: flats/houses, Paris, French countryside, Riviera, London, Tuscany, more; no exchanges. Two-week minimum. Over twenty years experience. <i> Abroad, Inc., Riverside Drive, New York, NY, tel . <a href=“website">website</a>.</i>
<hr/>
<b>CHARMING HOUSE—TODI, ITALY.</b> 4 bedrooms, fireplaces, garden, breathtaking views, parking. Tel:; fax: ; e-mail:
<a href=“EMAIL”>EMAIL</a>.
<hr/><b>PARIS-MARAIS</b> Musée Picasso. Archives Nationales. Very attractive one bedroom, large living room, den, bathroom, kitchen, all appliances. Nonsmokers. Biweekly/monthly/sabbaticals. PHONE
我想做的是将“出租”部分中每个列表的文本提取为列表中的单独项目。
似乎可以通过对标头的同级元素使用某种解析组合来完成。
但是,当我运行代码时:
soup = BeautifulSoup(contents, 'html')
target=soup.find("h5",text="RENTALS")
listingtext=[]
for sib in target.find_next_siblings():
if sib.name=="h5":
break
elif not sib.text:
pass
else:
listingtext.append(sib.text)
我所得到的是清单和电子邮件地址的所有粗体标头文本的列表,这是标签中包含的所有文本。 即我得到:
["NYC. GREENWICH VILLAGE.","EMAIL",'E. 71st & PARK.', 'BERKSHIRES—','SPECTACULAR VIEW OVER MANHATTAN.','COLD SPRING, NEW YORK.', 'DEMOCRATIC CONVENTION—','EMAIL']
我真正想要的是一个看起来像这样的列表
['NYC. GREENWICH VILLAGE. Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or EMAIL','E. 71st & PARK. Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.' ... ]
似乎我遇到的问题是由于文本未封闭,这影响了BeautifulSoup解析文本的方式。似乎我还可能需要弄清楚如何使用该标签(该标签在页面上用于在列表之间放置行)来分隔每个列表。
答案 0 :(得分:1)
您可以使用此示例仅从'RENTALS'
部分中解析我们的信息:
from bs4 import BeautifulSoup, Tag
txt = '''<h5>REAL ESTATE</h5>
<hr/><b>SANTA FE REALTOR</b> seeks culturally astute clients interested in relocation or second home. Contact Susan: <a href=“EMAIL”>EMAIL</a> or PHONE
<hr/>
<h5>RENTALS</h5>
<hr/><b>NYC. GREENWICH VILLAGE.</b> Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or <a href=“EMAIL”>EMAIL</a>.
<hr/><b>E. 71st & PARK.</b> Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
<hr/><b>BERKSHIRES—</b>extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
<hr/><b>SPECTACULAR VIEW OVER MANHATTAN.</b> Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
<hr/><b>DEMOCRATIC CONVENTION—</b>Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. <a href=“EMAIL”> EMAIL </a> or PHONE.
<hr/> <h5>INTERNATIONAL RENTALS</h5>
<hr/><b>SUPERB SABBATICALS</b> and vacation rentals: flats/houses, Paris, French countryside, Riviera, London, Tuscany, more; no exchanges. Two-week minimum. Over twenty years experience. <i> Abroad, Inc., Riverside Drive, New York, NY, tel . <a href=“website">website</a>.</i>
<hr/>
<b>CHARMING HOUSE—TODI, ITALY.</b> 4 bedrooms, fireplaces, garden, breathtaking views, parking. Tel:; fax: ; e-mail:
<a href=“EMAIL”>EMAIL</a>.
<hr/><b>PARIS-MARAIS</b> Musée Picasso. Archives Nationales. Very attractive one bedroom, large living room, den, bathroom, kitchen, all appliances. Nonsmokers. Biweekly/monthly/sabbaticals. PHONE'''
soup = BeautifulSoup(txt, 'html.parser')
for hr in soup.select('hr'):
if hr.find_previous('h5') is None or hr.find_previous('h5').text != 'RENTALS':
continue
out, s = [], hr.next_sibling
while not s is None and not (isinstance(s, Tag) and s.name in ('hr', 'h5')):
if isinstance(s, Tag):
out.append(s.get_text(strip=True))
elif s.strip():
out.append(s.strip())
s = s.next_sibling
if out:
print(' '.join(out))
print('-' * 80)
打印:
NYC. GREENWICH VILLAGE. Bed. Breakfast. Historic building, charming, great location. Short and long stays. PHONE or EMAIL .
--------------------------------------------------------------------------------
E. 71st & PARK. Quiet, beautiful, light-filled studio apartment. Available Wednesday-Sunday. Long-term. PHONE.
--------------------------------------------------------------------------------
BERKSHIRES— extraordinary country home on swim pond with beach, 26 acres, 10 min. Tanglewood, large tiled hot tub, 4+BR, 4FPL, writer's cottage, AC, $10K/month, July–August; other months/year-round available. PHONE
--------------------------------------------------------------------------------
SPECTACULAR VIEW OVER MANHATTAN. Furnished 1-bedroom apartment, quiet and secure, top floor upper East Side high-rise. $2,800 monthly, $800 weekly, minimum 2 weeks. PHONE or PHONE
--------------------------------------------------------------------------------
DEMOCRATIC CONVENTION— Newly furnished ground floor one-plus bedrooms/one bath apartment on Beacon Hill; all conveniences, sleeps 1–4, easy walk to all central Boston. Photos available. $6K convention week, $9K month of July or best offer. EMAIL or PHONE.
--------------------------------------------------------------------------------
答案 1 :(得分:0)
提供您要抓取的“网址”并
我将编辑此答案,并为您提供正确的输出方式