我正在尝试使用这个正则表达式匹配来自原始HTML字符串的Linux内核amd64 deb hrefs:
r'(?<=href=")linux-.*?_amd64\.deb(?=")'
我想要匹配的网址类似:
<a href="linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb">
但是,我只想提取"
属性中href
之间的内容。
上面的正则表达式确实匹配第一个href,然后匹配一堆东西,包括标记。
从正则表达式中删除_amd64
使其实际上只匹配URL,但当然,它不会过滤掉i386 debs:
r'(?<=href=")linux-.*?\.deb(?=")'
这是我正在应用正则表达式的原始HTML代码:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /~kernel-ppa/mainline/v3.16-rc1-utopic</title>
</head>
<body>
<h1>Index of /~kernel-ppa/mainline/v3.16-rc1-utopic</h1>
<table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr><tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td><a href="/~kernel-ppa/mainline/">Parent Directory</a></td><td> </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td><td><a href="0001-base-packaging.patch">0001-base-packaging.patch</a></td><td align="right">16-Jun-2014 04:35 </td><td align="right"> 14M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td><td><a href="0002-debian-changelog.patch">0002-debian-changelog.patch</a></td><td align="right">16-Jun-2014 04:35 </td><td align="right">333K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td><td><a href="0003-configs-based-on-Ubuntu-3.15.0-7.12.patch">0003-configs-based-on-Ubuntu-3.15.0-7.12.patch</a></td><td align="right">16-Jun-2014 04:35 </td><td align="right"> 51K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="BUILD.LOG">BUILD.LOG</a></td><td align="right">16-Jun-2014 05:25 </td><td align="right">7.1M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="BUILD.LOG.amd64">BUILD.LOG.amd64</a></td><td align="right">16-Jun-2014 05:25 </td><td align="right">2.3M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="BUILD.LOG.armhf">BUILD.LOG.armhf</a></td><td align="right">16-Jun-2014 05:25 </td><td align="right">597K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="BUILD.LOG.binary-headers">BUILD.LOG.binary-headers</a></td><td align="right">16-Jun-2014 05:25 </td><td align="right"> 22K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="BUILD.LOG.i386">BUILD.LOG.i386</a></td><td align="right">16-Jun-2014 05:25 </td><td align="right">2.3M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="BUILT">BUILT</a></td><td align="right">16-Jun-2014 05:25 </td><td align="right">108 </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="CHANGES">CHANGES</a></td><td align="right">16-Jun-2014 04:35 </td><td align="right">744K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="CHECKSUMS">CHECKSUMS</a></td><td align="right">09-Jun-2015 11:36 </td><td align="right">3.1K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="CHECKSUMS.gpg">CHECKSUMS.gpg</a></td><td align="right">09-Jun-2015 11:36 </td><td align="right">490 </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="COMMIT">COMMIT</a></td><td align="right">29-May-2015 11:09 </td><td align="right"> 51 </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/hand.right.gif" alt="[ ]"></td><td><a href="README">README</a></td><td align="right">12-Jun-2015 13:45 </td><td align="right">622 </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="SOURCES">SOURCES</a></td><td align="right">12-Jun-2015 13:45 </td><td align="right">237 </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb">linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:55 </td><td align="right">1.1M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb">linux-headers-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:15 </td><td align="right">1.0M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb">linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:56 </td><td align="right">1.1M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb">linux-headers-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:17 </td><td align="right">1.0M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-headers-3.16.0-031600rc1_3.16.0-031600rc1.201406160035_all.deb">linux-headers-3.16.0-031600rc1_3.16.0-031600rc1.201406160035_all.deb</a></td><td align="right">16-Jun-2014 04:36 </td><td align="right"> 12M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb">linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:55 </td><td align="right"> 51M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb">linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:15 </td><td align="right"> 51M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb">linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_amd64.deb</a></td><td align="right">16-Jun-2014 04:56 </td><td align="right"> 51M</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb">linux-image-3.16.0-031600rc1-lowlatency_3.16.0-031600rc1.201406160035_i386.deb</a></td><td align="right">16-Jun-2014 05:17 </td><td align="right"> 51M</td><td> </td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.2.22 (Ubuntu) Server at kernel.ubuntu.com Port 80</address>
</body>
</html>
我正在使用re.findall(pattern, rawHTMLString)
。正则表达式有什么问题?
答案 0 :(得分:1)
试试这个:
(?<=href=")linux-[^"]*?_amd64\.deb(?=")
你的。*?似乎太贪心了,所以跳过引号至少会跳过引用的区域。
答案 1 :(得分:0)
正在发生的事情是它开始匹配以linux-
开头但没有_amd64.deb
的网址,然后匹配一直持续到其他网址中找到_amd64.db
为止。所以匹配包含这两个URL之间的所有内容。你需要替换
.*?
阻止URL之间的匹配标记。你可以使用
[^"]*
因为您在引号之间匹配文本,所以匹配不能包含引号。