BeautifulSoup4和HTML

时间:2018-07-18 19:26:03

标签: html web-scraping beautifulsoup

我想使用python和bs4从以下html代码中提取以下信息; h2类的地名值, 跨度类值, div class =“ aithousaspec”值

<div class="results-list">  
    <div class="piatsaname">city center</div>     
        <table>
           <tr class="trspacer-up">
              <td>
                 <a href="hall.aspx?id=1001173">
                    <h2 class="placename">ARENA                         
                       <span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
                                         25 Richmond Avenue st, Leeds</span>
                    </h2>
                 <p>
                    +4497XXXXXXX<br>
                    STEREO SOUND
                 </p>
                 Every Monday 2 tickets 8,00 pounds

               </a>
             </td>
           </tr>
           <tr class="trspacer-down">
             <td>        
               <p class="coloredelement"><a href="movie.aspx?id=10061364" target="_self">Italian Job</a></p>

                  <div class="aithousaspec">
                    <b></b> Thu.-Wed.: 20.50/ 23.00
                    <a href="https://www.something.co.uk/" target="_blank" title="Whatever you like"></a>
                      <b></b>
                  </div>

我使用的代码效率不高

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())

mydivs = soup.select('div.results-list')
for info in mydivs:
     time= info.select('div.aithousaspec')
     print time 
     listCinemas = info.select("a[href*=hall.aspx]")
     print listCinemas
print len(listCinemas)
for times in time:
    proj= times.find('div.aithousaspec')
    print proj
for names in listCinemas:
    theater = names.find('h2', class_='placename')
    print(names.find('h2').find(text=True).strip())
    print (names.find('h2').contents[1].text.strip())

是否有更好的方法来获取提及的信息?

1 个答案:

答案 0 :(得分:0)

data = '''<div class="results-list">
    <div class="piatsaname">city center</div>
        <table>
           <tr class="trspacer-up">
              <td>
                 <a href="hall.aspx?id=1001173">
                    <h2 class="placename">ARENA
                       <span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
                                         25 Richmond Avenue st, Leeds</span>
                    </h2>
                 <p>
                    +4497XXXXXXX<br>
                    STEREO SOUND
                 </p>
                 Every Monday 2 tickets 8,00 pounds

               </a>
             </td>
           </tr>
           <tr class="trspacer-down">
             <td>
               <p class="coloredelement"><a href="movie.aspx?id=10061364" target="_self">Italian Job</a></p>

                  <div class="aithousaspec">
                    <b></b> Thu.-Wed.: 20.50/ 23.00
                    <a href="https://www.something.co.uk/" target="_blank" title="Whatever you like"></a>
                      <b></b>
                  </div>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'lxml')
print(soup.select('h2.placename')[0].contents[0].strip())
print(re.sub(r'\s{2,}', ' ', soup.select('span.boldelement')[0].text.strip()))
print(soup.select('div.aithousaspec')[0].text.strip())

这将打印:

ARENA
Θερινός 25 Richmond Avenue st, Leeds
Thu.-Wed.: 20.50/ 23.00