我的网页看起来像这样
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>
我想使用python regex从这个站点获取某些值。
在<div align="center">
之后我想获得href值:&#34; /title/name.php"和img src:&#34; ./ movie / image.jpg&#34;和标题 - 来自<h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1>
我试过这个:
regex = 'class="main_tb3"*\n<a href="(.+?)" target="_blank">\n<img src="(.+?)"'
请帮帮我
答案 0 :(得分:0)
你可以使用以下正则表达式
对于href值:
<a href="(.*?)"
对于Image src:
<img src="(.*?)"
标题:
titleid=12">(.*?)<
答案 1 :(得分:0)
你会发现安装像BeautifulSoup
这样的东西要简单得多:
from bs4 import BeautifulSoup
html = """
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>"""
soup = BeautifulSoup(html)
for table in soup.find_all("table", class_="main_tb3"):
print table.find('a').get('href')
print table.find('h1').text
对于您提供的HTML,这将打印以下内容:
/title/name.php
Title - secondname
/title/name.php
Title - secondname