Question

我正在寻找可以找到以下所有网址的正则表达式：

hello.com hello1.com 1hello.com hello-1.com hello-hi1.com 1hello-hi.com h3ll0.com

我尝试了几种不同的Regex，但是似乎没有什么合适的。

regex = re.compile('\w+\.(com|org|net)')
data = regex.search(string)
url = data.group(0)

我希望它返回上述所有网址

Answer 1

您可以在正则表达式中添加(-\w+)*这部分，以使其在网址的域名部分中包含可选的连字符。您可以使用此URL，

\w+(?:-\w+)*\.(?:com|org|net)
   ^^^^^^^^^ this allows the URL to have optional hyphen

除非您确实需要它们，否则应使该组不被捕获，因为这样可以提高其性能。

Answer 2

可以尝试用'。'分割字符串。分隔符，然后检查该值是否在说['com'，'org'，'net'，'io'....]的白名单中。

例如

whitelist = {'com', 'org', 'net', 'io'}
possible_url = 'hello.com'
if possible_url.split('.')[-1] in whitelist:
    return True

Answer 3

使用简单的正则表达式可能会导致您不小心匹配单词。例如，仅使用[\w-]+\.(com|org|net) demo#1即可满足您的要求，但会错过所有其他域，会错过子域并匹配普通单词。

此正则表达式可能更适合\b\w[-.\w]+\.(com|org|net)\b demo#2