我有一个像这样的URL列表:
http://www.toto.com/bags/handbags/test1/
http://www.toto.com/bags/handbags/smt1/
http://www.toto.com/bags/handbags/test1/test2/
http://www.toto.com/bags/handbags/blabla1/blabla2/
http://www.toto.com/bags/handbags/smt1/smt2/
http://www.toto.com/bags/handbags/smt1/smt2/testing/
http://www.toto.com/bags/handbags/smt1/smt2/testing.html
我想要的是只采用像
这样的URLhttp://www.toto.com/something/else/again/more
受限于此,如果还有更多,则不予采取。
你能救我吗? :)答案 0 :(得分:2)
适当的正则表达式是:
^http://www.toto.com/(\w+/){4}$
过滤示例:
>>> for line in lines:
... if re.match(r'^http://www.toto.com/(\w+/){4}$', line):
... print line
...
http://www.toto.com/bags/handbags/test1/test2/
http://www.toto.com/bags/handbags/blabla1/blabla2/
http://www.toto.com/bags/handbags/smt1/smt2/
答案 1 :(得分:0)
你可以这样做:
https://regex101.com/r/gK6hR3/1
但在最后
添加$
http:\/\/www\.[a-zA-Z.-]+\/[a-zA-Z-]+[\/]{0,1}[\.a-zA-Z-]{0,}
这样:
http:\/\/www\.[a-zA-Z.-]+\/[a-zA-Z-]+[\/]{0,1}[\.a-zA-Z-]{0,}$