我已经在python中创建了一个脚本,用于从一些html元素中解析地址。执行脚本时,我从元素中获得title
,address
和phone
的编号,而我的意图是仅获取 地址 。如果我使用next_sibling
,则只能获得地址的第一部分,并用br标签分隔,这就是为什么我跳过了这种方法。
如何从下面的代码段中仅获取地址,而没有其他信息?
from bs4 import BeautifulSoup
htmldoc = """
<div class="search-article-title-description">
<div class="search-article-title">
<a href="https://www.pga.com/pgapro/info/999918438?atrack=pgapro%3Anone&seapos=result%3A1%3AJeff%20S%20Swangim%2C%20PGA&page=1">Jeff S Swangim, PGA</a>
<div class="search-article-protitle">
Assistant Professional
</div>
</div>
<div class="search-article-address">
<div class="search-instructor-course">
Lake Toxaway Country Club
</div>
4366 W Club Blvd<br>Lake Toxaway, NC 28747-8538<br>
<div class="spotlightphone_num">
(828) 966-4661
</div>
</div>
</div>
"""
soup = BeautifulSoup(htmldoc,"lxml")
address = soup.select_one(".search-article-address").get_text(strip=True)
print(address)
我现在正在得到什么:
Lake Toxaway Country Club4366 W Club BlvdLake Toxaway, NC 28747-8538(828) 966-4661
我的预期输出:
4366 W Club BlvdLake Toxaway, NC 28747-8538
答案 0 :(得分:2)
我想到的最简单的方法是使用.extract()
函数踢出您不感兴趣的部分。如果我们可以忽略此类search-instructor-course
和spotlightphone_num
的内容,那么其余部分就是所需的部分。
以下脚本应该给我们地址:
from bs4 import BeautifulSoup
htmldoc = """
<div class="search-article-title-description">
<div class="search-article-title">
<a href="https://www.pga.com/pgapro/info/999918438?atrack=pgapro%3Anone&seapos=result%3A1%3AJeff%20S%20Swangim%2C%20PGA&page=1">Jeff S Swangim, PGA</a>
<div class="search-article-protitle">
Assistant Professional
</div>
</div>
<div class="search-article-address">
<div class="search-instructor-course">
Lake Toxaway Country Club
</div>
4366 W Club Blvd<br>Lake Toxaway, NC 28747-8538<br>
<div class="spotlightphone_num">
(828) 966-4661
</div>
</div>
</div>
"""
soup = BeautifulSoup(htmldoc,"lxml")
[item.extract() for item in soup.find_all(class_=["search-instructor-course","spotlightphone_num"])]
address = soup.select_one(".search-article-address").get_text(strip=True)
print(address)
答案 1 :(得分:1)
您在此处使用xpath表达式和lxml。您仍然可以将HTML内容传递给它。
from lxml import html
h = '''
<div class="search-article-title-description">
<div class="search-article-title">
<a href="https://www.pga.com/pgapro/info/999918438?atrack=pgapro%3Anone&seapos=result%3A1%3AJeff%20S%20Swangim%2C%20PGA&page=1">Jeff S Swangim, PGA</a>
<div class="search-article-protitle">
Assistant Professional
</div>
</div>
<div class="search-article-address">
<div class="search-instructor-course">
Lake Toxaway Country Club
</div>
4366 W Club Blvd<br>Lake Toxaway, NC 28747-8538<br>
<div class="spotlightphone_num">
(828) 966-4661
</div>
</div>
</div>
'''
tree = html.fromstring(h)
links = [link.strip() for link in tree.xpath("//div[@class='search-article-address']/br/preceding-sibling::text()[1]")]
print(' '.join(links))
输出:
或更简单地说,感谢@SIM,只需:
print(' '.join(tree.xpath("//div[@class='search-article-address']/text()")))
答案 2 :(得分:0)
也许有一种更优雅的方式,但是您正确使用.next_sibling
from bs4 import BeautifulSoup
htmldoc = """
<div class="search-article-title-description">
<div class="search-article-title">
<a href="https://www.pga.com/pgapro/info/999918438?atrack=pgapro%3Anone&seapos=result%3A1%3AJeff%20S%20Swangim%2C%20PGA&page=1">Jeff S Swangim, PGA</a>
<div class="search-article-protitle">
Assistant Professional
</div>
</div>
<div class="search-article-address">
<div class="search-instructor-course">
Lake Toxaway Country Club
</div>
4366 W Club Blvd<br>Lake Toxaway, NC 28747-8538<br>
<div class="spotlightphone_num">
(828) 966-4661
</div>
</div>
</div>
"""
soup = BeautifulSoup(htmldoc,"html.parser")
addr = soup.find('div', {'class':'search-instructor-course'}).next_sibling.strip()
state_zip = soup.find('div', {'class':'search-instructor-course'}).next_sibling.next_sibling.next_sibling.strip()
print (' '.join([addr, state_zip]))
输出:
print (' '.join([addr, state_zip]))
4366 W Club Blvd Lake Toxaway, NC 28747-8538