Question

所以，这是我的问题：

我有一个抓取工具，可以下载网页并删除网页（以便将来抓取）。我的抓取工具使用正则表达式中指定的URL白名单进行操作，因此它们符合以下几行：

(http://www.example.com/subdirectory/)(.*?)

...这将允许将来抓取遵循该模式的网址。我遇到的问题是我想要排除URL中的某些字符，以便（例如）地址，例如：

(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print)

...在上面的例子中，作为一个例子，我希望能够排除具有？，＃和=（以避免抓取这些页面）的URL。我尝试了很多不同的方法，但我似乎无法做到正确：

(http://www.example.com/)([^=\?#](.*?))

等。任何帮助都会非常感激！

编辑：对不起，应该提到这是用Python编写的，而且我通常都非常精通正则表达式（虽然这让我感到难过）

编辑2：VoDurden的答案（下面接受的答案）几乎产生了正确的结果，它所需要的只是表达式末尾的$字符，它完美地运作 - 例如：

(http://www.example.com/)([^=\?#]*)$

Answer 1

(http://www.example.com/)([^=?#]*?)

如果这样做，这将允许任何不包含您不想要的字符的URL。

然而，扩展这种方法可能有点困难。一个更好的选择是让系统工作两层，即一组匹配的正则表达式和一组阻塞正则表达式。然后只允许传递这两者的URL：s。我认为这个解决方案会更加透明和灵活。

Answer 2

您需要将网页抓取到?param=1&param=5

因为通常param = 1和param = 2可以给你完全不同的网页。

选择一个wordpress网站确认。

尝试这样一个，它会尝试在＃char

之前匹配

(http://www.example.com/)([^#]*?)

Answer 3

这个表达应该是你想要的：

(http://www.example.com/subdirectory/)([^=?#]*)$

[^ = \？＃]将匹配除指定字符之外的任何内容。

例如：

http://www.example.com/subdirectory/ 匹配
http://www.example.com/subdirectory/index.php 匹配
http://www.example.com/subdirectory/somepage?param=1&param=5#print 不匹配
http://www.example.com/subdirectory/index.php?param=1 不匹配

Answer 4

我不确定你想要什么。如果你不想匹配任何不包含任何内容的任何东西？，＃和=那么正则表达式是

([^=?#]*)

Answer 5

作为替代方案，总是有urlparse模块，用于解析URL。

from urlparse import urlparse

urls= [
    'http://www.example.com/subdirectory/',
    'http://www.example.com/subdirectory/index.php',
    'http://www.example.com/subdirectory/somepage?param=1&param=5#print',
    'http://www.example.com/subdirectory/index.php?param=1',
]

for url in urls:
    # in python 2.5+ you can use urlparse(url).query instead
    if not urlparse(url)[4]:
        print url

提供以下内容：

http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php

正则表达式仅在某些字符不存在时匹配字符串

5 个答案: