Question

我有一个网址：http://200.73.81.212/.CREDIT-UNION/update.php我发现并发展自己的reg表达式均无效。我正在研究网络钓鱼邮件数据集，并且有很多奇怪的超链接。这是我的之一：
https?:\/\/([a-zA-z0-9]+.)+)|(www.[a-zA-Z0-9]+.([a-zA-Z0-9]+\.[a-zA-Z0-9]+)+)(((/[\.A-Za-z0-9]+))+/?。
当然没有成功。我在Python中工作。
编辑：
我需要一个正则表达式来捕获此类url，以及任何普通的超链接，例如：
https://cnn.com/
www.foxnews.com/story/122345678
有什么想法吗？

Answer 1

这样的事情呢？

import re

phish = re.compile('''(?P<http>http\://)
                        (?P<ipaddress>(([0-9]*(\.)?)[0-9]*)*)/\.
                        (?P<name>(\.)?([A-Za-z]*)(\-)?([A-Za-z]*))/
                        (?P<ending>(update\.php))''', re.VERBOSE)

example_string = 'http://200.73.81.212/.CREDIT-UNION/update.php'

found_matches = []
# check that matches actually exist in input string
if phish.search(example_string):
    # in case there are many matches, iterate over them
    for mtch in phish.finditer(example_string):
        # and append matches to master list
        found_matches.append(mtch.group(0))

print(found_matches)
# ['http://200.73.81.212/.CREDIT-UNION/update.php']

这足够灵活，因此现在如果您有其他结尾而不是update.php，则可以通过用|分隔所有其他结尾匹配来简单地将它们包含在命名捕获组中，即

(update\.php|remove\.php, ...)

此外，您的名为捕获组的IP地址可以采用任意数量的123.23.123.12，它不必是固定数量的重复数字，也可以是句点模式。现在，我认为IP地址的上限为3个数字，因此您可以将其固定下来，以确保使用大括号将正确的数字类型匹配：

[0-9]{2, 3}\. # minimum of 2 numbers, maximum of 3

Answer 2

虽然@datawrestler答案适用于原始问题，但我不得不扩展它以捕获更多的url（我已经编辑了问题）。这个网址似乎可以完成以下任务：
r"""(https?://www\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.@-]+){0,20})|\ (https?://[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.@-]+){0,20})|\ (www.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.@-]+){0,20})"""
三种选择：https?://www，https://domain，www.domain

正则表达式捕获网址

2 个答案: