我想使用python和bs4从以下html代码中提取以下信息; h2类的地名值, 跨度类值, div class =“ aithousaspec”值
<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement"><a href="movie.aspx?id=10061364" target="_self">Italian Job</a></p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<a href="https://www.something.co.uk/" target="_blank" title="Whatever you like"></a>
<b></b>
</div>
我使用的代码效率不高
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
mydivs = soup.select('div.results-list')
for info in mydivs:
time= info.select('div.aithousaspec')
print time
listCinemas = info.select("a[href*=hall.aspx]")
print listCinemas
print len(listCinemas)
for times in time:
proj= times.find('div.aithousaspec')
print proj
for names in listCinemas:
theater = names.find('h2', class_='placename')
print(names.find('h2').find(text=True).strip())
print (names.find('h2').contents[1].text.strip())
是否有更好的方法来获取提及的信息?
答案 0 :(得分:0)
data = '''<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement"><a href="movie.aspx?id=10061364" target="_self">Italian Job</a></p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<a href="https://www.something.co.uk/" target="_blank" title="Whatever you like"></a>
<b></b>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('h2.placename')[0].contents[0].strip())
print(re.sub(r'\s{2,}', ' ', soup.select('span.boldelement')[0].text.strip()))
print(soup.select('div.aithousaspec')[0].text.strip())
这将打印:
ARENA
Θερινός 25 Richmond Avenue st, Leeds
Thu.-Wed.: 20.50/ 23.00