Python正则表达式忽略了新行

时间:2016-03-08 05:13:10

标签: php python html regex beautifulsoup

我的网页看起来像这样

<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>

我想使用python regex从这个站点获取某些值。 在<div align="center">之后我想获得href值:&#34; /title/name.php"和img src:&#34; ./ movie / image.jpg&#34;和标题 - 来自<h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1>

的第二个名字

我试过这个: regex = 'class="main_tb3"*\n<a href="(.+?)" target="_blank">\n<img src="(.+?)"'

请帮帮我

2 个答案:

答案 0 :(得分:0)

你可以使用以下正则表达式

  

对于href值:<a href="(.*?)"

     

对于Image src:<img src="(.*?)"

     

标题:titleid=12">(.*?)<

答案 1 :(得分:0)

你会发现安装像BeautifulSoup这样的东西要简单得多:

from bs4 import BeautifulSoup

html = """
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>"""

soup = BeautifulSoup(html)

for table in soup.find_all("table", class_="main_tb3"):
    print table.find('a').get('href')
    print table.find('h1').text

对于您提供的HTML,这将打印以下内容:

/title/name.php
Title - secondname
/title/name.php
Title - secondname