如何使用BeautifulSoup Python从网站获取数据?

时间:2017-11-24 22:48:41

标签: python-3.x web-scraping beautifulsoup

我从某个页面获取数据时遇到问题。这是我的代码的一部分:

for result in results:
        street = result.find('p', attrs={'class':'size16'}).text
        records.append((street))  
        print (street)

网站:

    <div class="media-body pt5 pb10">
     <div class="mb15">
        <span class="map-item-city block mb0 colorgreen">City</span>
        <p class="small mb20">&nbsp;</p>
        <p class="size16">street 98<br>phone. 22 721-56-70</p>
     </div>
     <div class="colorblack"><strong>open</strong></div>
     <div class="mb20 size16">Mon.-Fr. 07.30-15.30</div>
     <div class="mb15 ">

我的代码结果:

ul. Bema 2phone. (32) 745 72 66-69 Wroclaw None
ul. 1 Maja 22/Vphone. 537-943-969 Olawa <p class="small mb20 colorgreen">Placowka partnerska</p>

我想在“br”标签后分隔或删除文字。我只需要'街头'

    <p class="size16">street 98<br>phone. 22 721-56-70</p>
你能帮助我吗?

1 个答案:

答案 0 :(得分:1)

像这样使用previous_sibling:

from bs4 import BeautifulSoup

html = """
<div class="media-body pt5 pb10">
     <div class="mb15">
        <span class="map-item-city block mb0 colorgreen">Bronisze</span>
        <p class="small mb20">&nbsp;</p>
        <p class="size16">Poznańska 98<br>tel. 22 721-56-70</p>
     </div>
     <div class="colorblack"><strong>Godziny otwarcia</strong></div>
     <div class="mb20 size16">Pn.-Pt. 07.30-15.30</div>
<div class="mb15 ">
"""

result=BeautifulSoup(html, "lxml")

br = result.find('br')
print (br.previous_sibling)

或者如果你想稍微缩小一点:

street = result.find('p', attrs={'class':'size16'}).find('br').previous_sibling
print (street)

输出(两种情况下)

Poznańska 98

来自文档https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  

.next_sibling和.previous_sibling

     

您可以使用.next_sibling和.previous_sibling在分析树的同一级别的页面元素之间导航: