Question

我需要提取一个网址，但它没有<a>周围的标记。

示例：

<a href="http://www.google.com">Google</a>, http://example.com, www.example.com

我必须得到：

http://example.com
www.example.com

感谢您的帮助，我使用谷歌翻译，对不起该错误。

Answer 1

您可以使用此RegEx：

(?<!href=(?:"|'))((https?|ftp):\/\/(\w+\.)+(\w{1,3})(\/[^\s]*)?)

从 1st Capture Group

中提取数据

Live Demo on Regex101

工作原理：

(?<!          # Negative Lookbehind to fail if inside a href of an <a> tag
  href=         # href=
  (?:"|')       # Opening ' OR "
)
(             # Capture URL (Capture Group #1)
  (https?|ftp)  # Protocol (HTTP, HTTPS or FTP)
  :\/\/         # ://
  (\w+\.)+      # Domain and Sub-Domains
  (\w{1,3})     # Final TLD (.com, .org, .uk)
  (             # Path after Website (Optional)
    \/          # /
    [^\s]*      # Any character except a [space] any number of times (including 0)
  )?
)

Answer 2

这个正则表达式捕获了所有的URL，正如我为她做的那样，不要选择那些在TAG内的URL

((?:https?|ftp):\/\/(?:\S*?\.\S*?))(?:[\s)\[\]{},;"\':<]|\.\s|$)

然而，这需要很多步骤并且资源非常广泛

Answer 3

虽然不建议使用正则表达式，但如果你真的需要它，那么它就在这里：

\<a[^\>]+?[\'|\"](http[^\'|\"]+)[\'|\"]

将同时使用http和https链接

在此测试：https://regex101.com/r/xM7mM0/1

<强>更新

这一项捕获了所有项目：

<a href="http://www.google.com">Google</a>, http://example.com, www.example.com

正则表达式：

((?:http\:\/\/)?(?:www\.)?\w+\.com)

在这里测试：

https://regex101.com/r/xM7mM0/2

更新2：

这个选择链接“没有”

((?:http\:\/\/)?(?:www\.)?\w+\.com)[^\"\']

在此测试：https://regex101.com/r/xM7mM0/3

请注意，只有在我们知道您要使用它的确切场景时才能优化正则表达式。

正则表达式提取URL没有超链接

3 个答案: