关于特定URL的Regexp

时间:2016-05-02 01:04:23

标签: python regex

我有一个像这样的URL列表:

http://www.toto.com/bags/handbags/test1/
http://www.toto.com/bags/handbags/smt1/
http://www.toto.com/bags/handbags/test1/test2/
http://www.toto.com/bags/handbags/blabla1/blabla2/
http://www.toto.com/bags/handbags/smt1/smt2/
http://www.toto.com/bags/handbags/smt1/smt2/testing/
http://www.toto.com/bags/handbags/smt1/smt2/testing.html

我想要的是只采用像

这样的URL
http://www.toto.com/something/else/again/more

受限于此,如果还有更多,则不予采取。

你能救我吗? :)

2 个答案:

答案 0 :(得分:2)

适当的正则表达式是:

^http://www.toto.com/(\w+/){4}$

过滤示例:

>>> for line in lines:
...     if re.match(r'^http://www.toto.com/(\w+/){4}$', line):
...         print line
... 
http://www.toto.com/bags/handbags/test1/test2/
http://www.toto.com/bags/handbags/blabla1/blabla2/
http://www.toto.com/bags/handbags/smt1/smt2/

答案 1 :(得分:0)

你可以这样做:

https://regex101.com/r/gK6hR3/1

但在最后

添加$
http:\/\/www\.[a-zA-Z.-]+\/[a-zA-Z-]+[\/]{0,1}[\.a-zA-Z-]{0,}

这样:

http:\/\/www\.[a-zA-Z.-]+\/[a-zA-Z-]+[\/]{0,1}[\.a-zA-Z-]{0,}$