我不明白为什么这个匹配正则表达式的网址不起作用

时间:2015-10-10 00:30:40

标签: python regex

我正在尝试使用这个正则表达式匹配来自原始HTML字符串的Linux内核amd64 deb hrefs:

r'(?<=href=")linux-.*?_amd64\.deb(?=")'

我想要匹配的网址类似:

<a href="linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb">

但是,我只想提取"属性中href之间的内容。 上面的正则表达式确实匹配第一个href,然后匹配一堆东西,包括标记。 从正则表达式中删除_amd64使其实际上只匹配URL,但当然,它不会过滤掉i386 debs:

r'(?<=href=")linux-.*?\.deb(?=")'

这是我正在应用正则表达式的原始HTML代码:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
    <head>
        <title>Index of /~kernel-ppa/mainline/v3.16-rc1-utopic</title>
    </head>
    <body>
        <h1>Index of /~kernel-ppa/mainline/v3.16-rc1-utopic</h1>
        <table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr><tr><th colspan="5"><hr></th></tr>
            <tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td><a href="/~kernel-ppa/mainline/">Parent Directory</a></td><td>&nbsp;</td><td align="right">  - </td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td><td><a href="0001-base-packaging.patch">0001-base-packaging.patch</a></td><td align="right">16-Jun-2014 04:35  </td><td align="right"> 14M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td><td><a href="0002-debian-changelog.patch">0002-debian-changelog.patch</a></td><td align="right">16-Jun-2014 04:35  </td><td align="right">333K</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td><td><a href="0003-configs-based-on-Ubuntu-3.15.0-7.12.patch">0003-configs-based-on-Ubuntu-3.15.0-7.12.patch</a></td><td align="right">16-Jun-2014 04:35  </td><td align="right"> 51K</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="BUILD.LOG">BUILD.LOG</a></td><td align="right">16-Jun-2014 05:25  </td><td align="right">7.1M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="BUILD.LOG.amd64">BUILD.LOG.amd64</a></td><td align="right">16-Jun-2014 05:25  </td><td align="right">2.3M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="BUILD.LOG.armhf">BUILD.LOG.armhf</a></td><td align="right">16-Jun-2014 05:25  </td><td align="right">597K</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="BUILD.LOG.binary-headers">BUILD.LOG.binary-headers</a></td><td align="right">16-Jun-2014 05:25  </td><td align="right"> 22K</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="BUILD.LOG.i386">BUILD.LOG.i386</a></td><td align="right">16-Jun-2014 05:25  </td><td align="right">2.3M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="BUILT">BUILT</a></td><td align="right">16-Jun-2014 05:25  </td><td align="right">108 </td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="CHANGES">CHANGES</a></td><td align="right">16-Jun-2014 04:35  </td><td align="right">744K</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="CHECKSUMS">CHECKSUMS</a></td><td align="right">09-Jun-2015 11:36  </td><td align="right">3.1K</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="CHECKSUMS.gpg">CHECKSUMS.gpg</a></td><td align="right">09-Jun-2015 11:36  </td><td align="right">490 </td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="COMMIT">COMMIT</a></td><td align="right">29-May-2015 11:09  </td><td align="right"> 51 </td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/hand.right.gif" alt="[   ]"></td><td><a href="README">README</a></td><td align="right">12-Jun-2015 13:45  </td><td align="right">622 </td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="SOURCES">SOURCES</a></td><td align="right">12-Jun-2015 13:45  </td><td align="right">237 </td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb">linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:55  </td><td align="right">1.1M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb">linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:15  </td><td align="right">1.0M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb">linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:56  </td><td align="right">1.1M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb">linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:17  </td><td align="right">1.0M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-headers-3.16.0-031600rc1_3.16.0-031600rc1.201406160035_all.deb">linux-headers-3.16.0-031600rc1_3.16.0-031600rc1.201406160035_all.deb</a></td><td align="right">16-Jun-2014 04:36  </td><td align="right"> 12M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb">linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:55  </td><td align="right"> 51M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb">linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:15  </td><td align="right"> 51M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb">linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:56  </td><td align="right"> 51M</td><td>&nbsp;</td></tr>
            <tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb">linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:17  </td><td align="right"> 51M</td><td>&nbsp;</td></tr>
            <tr><th colspan="5"><hr></th></tr>
        </table>
        <address>Apache/2.2.22 (Ubuntu) Server at kernel.ubuntu.com Port 80</address>
    </body>
</html>

我正在使用re.findall(pattern, rawHTMLString)。正则表达式有什么问题?

2 个答案:

答案 0 :(得分:1)

试试这个:

(?<=href=")linux-[^"]*?_amd64\.deb(?=")

你的。*?似乎太贪心了,所以跳过引号至少会跳过引用的区域。

答案 1 :(得分:0)

正在发生的事情是它开始匹配以linux-开头但没有_amd64.deb的网址,然后匹配一直持续到其他网址中找到_amd64.db为止。所以匹配包含这两个URL之间的所有内容。你需要替换

.*?

阻止URL之间的匹配标记。你可以使用

[^"]*

因为您在引号之间匹配文本,所以匹配不能包含引号。

DEMO