尝试获取domain.zz或domain.zzz或domain.zz.zz或/某事。
import re
the_string = """lalalla?url=http2F%2Fdomain.zz%slgkfgs0s"""
the_string = """lalalla?url=http2F%2Fdomain.zz.zz/something%slgkfgs0sf"""
the_string = """lalalla?url=randomh564domain.zzz/something%slgkfgs0sf"""
the_string = """lalalla?url=randomeefsdlk876%domain.zz/something%slgkfgs0sf"""
the_string = """p%3A%2F%2Fdummy_test.com/ratata%2F&"""
the_string = """p%3A%2F%2Fdum2test.co.uk/something%2F&-kj"""
这就是我现在所拥有的:
>>> print( re.findall('(?:www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4})(?:/[a-z0-9]+)',the_string))
domain.zzz/something
domain.zz/something
domain.zz.zz/something
>>> print( re.findall('www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}',the_string))
domain.zzz
domain.zz
domain.zz.zz
我想让这两个小组有一个问题。
编辑: 这个几乎是完美的: '([A-Z0-9 .-] + [A-Z] {2,4} [。])|(?:/ [A-Z0-9] +)' 但它从字符串的开头抓起一些垃圾。
该字符串比此示例更随机: 我关注这3个案例:
domain.co.uk/something
^ ^ ^
domain.com/something
^ ^
domain.com
^
答案 0 :(得分:1)
这个怎么样:
import re
the_string = """lalalla?url=http@domain.zz%slgkfgs0sf"""
the_string = """lalalla?url=http@domain.zz.zz/something%slgkfgs0sf"""
#the_string = """lalalla?url=http@domain.zzz/something%slgkfgs0sf"""
#the_string = """lalalla?url=ht%domain.zz/something%slgkfgs0sf"""
#the_string = """lalalla?url=httpsd%domain.zz.zz/something%slgkfgs0sf"""
#the_string = """lalalla?url=www.domain.zzz/something%slgkfgs0sf"""
test = re.compile('(?P<base>[a-zA-Z0-9_\-\.]*?[a-zA-Z0-9_\-]+\.[z\.]+)(?P<extra>/[a-zA-Z0-9_\-]+)')
for match in test.finditer(the_string):
print(match.group('base'))
print(match.group('extra'))
输出中:
domain.zz.zz
/something
这样你就可以在'base'和'extra'中同时拥有数据......将它们组合起来再次获得完整的字符串。
编辑:更新模式以获得更好的域匹配并更改了python 3语法的打印
答案 1 :(得分:1)
试试这个,我不知道这是否与zyour要求完全匹配,但也许你可以澄清问题,如果有问题可以进一步模式化......
print re.findall('=(?:[^@%/.]*(?:@|%(?:2F)?))?(?:www.)?(?P<domain>[^%@/]*)(?:/(?P<folder>[^%]*))?(?:[%@/].*)?$',the_string,re.MULTILINE)
如果您希望可以使用match.group('domain')
和match.group('folder')
输出:
[('domain.zz', ''), ('domain.zz.zz', 'something'), ('randomh564domain.zzz', 'something'), ('domain.zz', 'something'), ('domain.zz.zz', 'something'), ('domain.zzz', 'something')]