Question

我有正则表达式的问题 - 我有4个网址示例：

http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo 
http://auto.com/index.php/car-news/11654-battle-royale-2014
http://auto.com/index.php/tv-special-news/10480-new-film-4
http://auto.com/index.php/first/12234-new-volvo-xc60

我想排除内部有'tv-special-news'或最后有'photo'的网址。

我试过了：

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

但它并不完全符合我的要求

Answer 1

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

你很接近这一点。您只需要在(?!photo)之前删除短划线，以允许在没有尾随短划线的情况下结束行，并在末尾添加$以确保整行需要匹配。

然后你还必须将负面预测更改为负面外观，以确保在行结束时如果它前面有photo：{{ 1}}。

(?<!photo)

此外，你应该正确地逃避所有点：

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}(?<!photo)$

此外，量词http://(www\.)?auto\.com/index\.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]+(?<!photo)$相当于{1,}。

Answer 2

您可以使用此正则表达式：

^(?!.*-photo$)http://(?:www\.)?auto\.com/index\.php/(?!tv-special-news)[^/]+/[\w-]+-

RegEx Demo 1

(?!.*-photo$)

photo为否定匹配为负面预测。

(?!tv-special-news)

tv-special-news

/index.php/

最好在正则表达式中使用启动锚点

或者使用lookbehind正则表达式，您可以使用：

^http://(www\.)?auto\.com/index\.php/(?!tv-special-news).*/[a-zA-Z0-9-]+$(?<!photo)

RegEx Demo 2

Answer 3

您可以使用此解决方案：

import re

list_of_urls = ["http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo",....]


new_list = [i for i in list_of_urls if len(re.findall("photo+", i.split()[-1])) == 0 and len(re.findall("tv-special-news+", i.split()[-1])) == 0]

Answer 4

您只需将链接存储在列表中，然后使用正则表达式迭代它：

re_pattern = r＆＃39; \ b（？：tv-special-news | photo）\ b＆＃39;

re.findall（re_pattern，链路）

（其中链接将是列表中的项目）

如果模式匹配，则将结果存储在列表中。你必须检查列表是否为空。如果列表为空，则可以包含链接，否则将其排除。

以下是示例代码：

import re

links = ['http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo', 'http://auto.com/index.php/car-news/11654-battle-royale-2014', 'http://auto.com/index.php/tv-special-news/10480-new-film-4', 'http://auto.com/index.php/first/12234-new-volvo-xc60']

new_list = []

re_pattern = r'\b(?:tv-special-news|photo)\b' for link in links:    result = re.findall(re_pattern,link)        if len(result) < 1:         new_list.append(link)   

print new_list

Python Regex - 排除包含单词的url

4 个答案: