Question

我在使用这个正则表达式时遇到了麻烦，我想我差不多了。

m =re.findall('[a-z]{6}\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')

这给了我想要的“精确”输出。那是domain.com.uy，但显然这只是一个例子，因为[a-z]{6}只匹配前6个字符，这不是我想要的。

我希望它返回domain.com.uy所以基本上指令将匹配任何字符，直到遇到“/”（向后）。

编辑：

m =re.findall('\w+\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')

非常接近我想要但不匹配“_”或“ - ”。

为了完整起见，我不需要http://

我希望这个问题很清楚，如果我留下任何可以解释的内容，请要求任何澄清！

提前感谢！

Answer 1

另一种选择是使用positive lookbehind，例如(?<=//)：

>>> re.search(r'(?<=//).+(?= \" target)', 
...           'http://domain.com.uy " target').group(0)
'domain.com.uy'

请注意，如果需要，这将匹配网址内的斜杠：

>>> re.search(r'(?<=//).+(?= \" target)',
...           'http://example.com/path/to/whatever " target').group(0)
'example.com/path/to/whatever'

如果你只想要裸域，没有任何路径或查询参数，你可以使用r'(?<=//)([^/]+)(/.*)?(?= \" target)'并捕获第1组：

>>> re.search(r'(?<=//)([^/]+)(/.*)?(?= \" target)',
...           'http://example.com/path/to/whatever " target').groups()
('example.com', '/path/to/whatever')

Answer 2

如果不需要正则表达式，并且您只是希望从Python中的URL中提取FQDN。使用urlparse和str.split()：

>>> from urlparse import urlparse
>>> url = 'http://domain.com.uy " target'
>>> urlparse(url)
ParseResult(scheme='http', netloc='domain.com.uy " target', path='', params='', query='', fragment='')

这会将URL分解为其组成部分。我们想要netloc：

>>> urlparse(url).netloc
'domain.com.uy " target'

在空格上拆分：

>>> urlparse(url).netloc.split()
['domain.com.uy', '"', 'target']

只是第一部分：

>>> urlparse(url).netloc.split()[0]
'domain.com.uy'

Answer 3

试试这个（也许你需要在Python中转义/）：

/([^/]*)$

Answer 4

这很简单：

[^/]+(?= " target)

但请注意，http://domain.com/folder/site.php不会返回域名。并记住在字符串中正确地逃避正则表达式。

正则表达式返回所有字符，直到“/”向后搜索

4 个答案: