Question

有没有人知道我可以用来查找字符串中的URL的正则表达式？我在Google上发现了很多正则表达式，用于确定整个字符串是否为URL，但我需要能够在整个字符串中搜索URL。例如，我希望能够在以下字符串中找到www.google.com和http://yahoo.com：

Hello www.google.com World http://yahoo.com

我不是在寻找字符串中的特定URL。我正在寻找字符串中的所有URL，这就是我需要正则表达式的原因。

Answer 1

这是我使用的那个

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

适合我，也适合你。

Answer 2

猜猜没有正则表达式适合这种用法。我发现了一个相当稳固的here

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm

与此处发布的其他差异/优势相比：

不匹配电子邮件地址
匹配localhost：12345
如果没有moo.com或http

www

有关示例，请参阅here

Answer 3

text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
The code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)

输出：

[
    'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string', 
    'www.google.com', 
    'facebook.com',
    'http://test.com/method?param=wasd'
]

Answer 4

这里提供的解决方案都没有解决我遇到的问题/用例。

我在这里提供的是迄今为止我发现/制作的最好的。当我发现它无法处理的新边缘情况时，我会更新它。

\b
  #Word cannot begin with special characters
  (?<![@.,%&#-])
  #Protocols are optional, but take them with us if they are present
  (?<protocol>\w{2,10}:\/\/)?
  #Domains have to be of a length of 1 chars or greater
  ((?:\w|\&\#\d{1,5};)[.-]?)+
  #The domain ending has to be between 2 to 15 characters
  (\.([a-z]{2,15})
       #If no domain ending we want a port, only if a protocol is specified
       |(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])

Answer 5

我认为这个正则表达式模式正好能够处理你想要的东西

/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

这是一个提取Url的摘录示例：

// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";

// The Text you want to filter for urls
$text = "The text you want  https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string to filter goes here.";

// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);

Answer 6

以上所有答案都与网址中的Unicode字符不匹配，例如：http://google.com?query=đức+filan+đã+search

对于解决方案，这个应该可以工作：

(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)

Answer 7

这里有更多优化的正则表达式：

(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:\/~\+#]*[A-Z\-\@?^=%&amp;\/~\+#]){2,6}?

这里正在测试数据：https://regex101.com/r/sFzzpY/6

Answer 8

如果您必须严格选择链接，我会选择：

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

如需更多信息，请阅读：

An Improved Liberal, Accurate Regex Pattern for Matching URLs

Answer 9

如果你有url模式，你应该能够在你的字符串中搜索它。只需确保该模式没有^和$标记url字符串的开头和结尾。因此，如果P是URL的模式，请查找P的匹配。

Answer 10

我发现this涵盖了大多数示例链接，包括子目录部分。

正则表达式为：

(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))?

Answer 11

这个正则表达式非常适合我，也应该适合你

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

Answer 12

自己动手制作一个

let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#]?[\w-]+)*\/?/gm

它适用于以下所有域：

https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255

您可以see how it performs here on regex101并根据需要进行调整

Answer 13

这很简单。

使用此模式：\b((ftp|https?)://)?([\w-\.]+\.(com|net|org|gov|mil|int|edu|info|me)|(\d+\.\d+\.\d+\.\d+))(:\d+)?(\/[\w-\/]*(\?\w*(=\w+)*[&\w-=]*)*(#[\w-]+)*)?

它与包含以下内容的任何链接匹配：

允许的协议：http，https和ftp

允许的域：* .com，*。net，*。org，*。gov，*。mil，*。int，*。edu，*。info和* .me OR IP

允许的端口：true

允许的参数：true

允许的哈希值：true

Answer 14

使用@JustinLevene提供的正则表达式在反斜杠上没有正确的转义序列。已更新为现在正确，并添加了条件以匹配FTP协议：将匹配所有带有或不带有协议，没有“ www”的url。

代码：null

示例：https://regex101.com/r/uQ9aL4/65

Answer 15

(?:vnc|s3|ssh|scp|sftp|ftp|http|https)\:\/\/[\w\.]+(?:\:?\d{0,5})|(?:mailto|)\:[\w\.]+\@[\w\.]+

如果您想对每个部分进行解释，请尝试regexr [。] com，您将对每个字符都有很好的解释。

这用“ |”分隔或“ OR”，因为并非所有可用的URI都带有“ //”，因此您可以在此处创建感兴趣的匹配方案或条件的列表。

Answer 16

我利用了c＃Uri类，它可以很好地与IP地址，本地主机配合使用

 public static bool CheckURLIsValid(string url)
    {
        Uri returnURL;

       return (Uri.TryCreate(url, UriKind.Absolute, out returnURL)
           && (returnURL.Scheme == Uri.UriSchemeHttp || returnURL.Scheme == Uri.UriSchemeHttps));


    }

Answer 17

我喜欢Stefan Henze的解决方案，但它的效率为34.56。它太笼统了，我还没有解析HTML。一个网址有4个锚点；

www

http：\（和co），

。然后是字母，然后是/，

或字母。其中之一：https://ftp.isc.org/www/survey/reports/current/bynum.txt。

我从该线程中使用了很多信息。谢谢大家。

"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"

除字符串“ eurls：www.google.com，facebook.com，http：//test.com/”之类的字符串作为单个字符串返回外，以上内容几乎解决了所有问题。 TBH IDK为什么我添加了地鼠等。证明R代码

if(T){
  wierdurl<-vector()
  wierdurl[1]<-"https://JP納豆.例.jp/dir1/納豆 "
  wierdurl[2]<-"xn--jp-cd2fp15c.xn--fsq.jp "
  wierdurl[3]<-"http://52.221.161.242/2018/11/23/biofourmis-collab"
  wierdurl[4]<-"https://12000.org/ "
  wierdurl[5]<-"  https://vg-1.com/?page_id=1002 "
  wierdurl[6]<-"https://3dnews.ru/822878"
  wierdurl[7]<-"The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
  Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
  The code below catches all urls in text and returns urls in list. "
  wierdurl[8]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
  Alsotherearesomeurls:www.google.com,facebook.com,http://test.com/method?param=wasd
  Thecodebelowcatchesallurlsintextandreturnsurlsinlist. "
  wierdurl[9]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-stringAlsotherearesomeurlsZwww.google.com,facebook.com,http://test.com/method?param=wasdThecodebelowcatchesallurlsintextandreturnsurlsinlist."
  wierdurl[10]<-"1facebook.com/1res"
  wierdurl[11]<-"1facebook.com/1res/wat.txt"
  wierdurl[12]<-"www.e "
  wierdurl[13]<-"is this the file.txt i need"
  wierdurl[14]<-"xn--jp-cd2fp15c.xn--fsq.jpinspiredby "
  wierdurl[15]<-"[xn--jp-cd2fp15c.xn--fsq.jp/inspiredby "
  wierdurl[16]<-"xnto--jpto-cd2fp15c.xnto--fsq.jpinspiredby "
  wierdurl[17]<-"fsety--fwdvg-gertu56.ffuoiw--ffwsx.3dinspiredby "
  wierdurl[18]<-"://3dnews.ru/822878 "
  wierdurl[19]<-" http://mywebsite.com/msn.co.uk "
  wierdurl[20]<-" 2.0http://www.abe.hip "
  wierdurl[21]<-"www.abe.hip"
  wierdurl[22]<-"hardware/software/data"
  regexstring<-vector()
  regexstring[2]<-"(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[3]<-"/(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#\\/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#\\/%=~_|$])/igm"
  regexstring[4]<-"[a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]?"
  regexstring[5]<-"((http|ftp|https)\\:\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[6]<-"((http|ftp|https):\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?"
  regexstring[7]<-"(http|ftp|https)(:\\/\\/)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[8]<-"(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#/%=~_|$])"
  regexstring[10]<-"((http[s]?|ftp):\\/)?\\/?([^:\\/\\s]+)((\\/\\w+)*\\/)([\\w\\-\\.]+[^#?\\s]+)(.*)?(#[\\w\\-]+)?"
  regexstring[12]<-"http[s:/]+[[:alnum:]./]+"
  regexstring[9]<-"http[s:/]+[[:alnum:]./]+" #in DLpages 230
  regexstring[1]<-"[[:alnum:]-]+?[.][:alnum:]+?(?=[/ :])" #in link_graphs 50
  regexstring[13]<-"^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$"
  regexstring[14]<-"(((((http|ftp|https):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]+(?:(?:\\.[\\w_-]+)*))((\\.((org|com|net|edu|gov|mil|int)|(([:alpha:]{2})(?=[, ]))))|([\\/]([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
  regexstring[15]<-"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
    }

for(i in wierdurl){#c(7,22)
  for(c in regexstring[c(15)]) {
    print(paste(i,which(regexstring==c)))
    print(str_extract_all(i,c))
  }
}

Answer 18

这个怎么样？

(http:\/\/|ftp:\/\/|https:\/\/|www\.)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?

它在问题中都匹配。

Answer 19

这个稍微简单的 GooDeeJAY 答案版本对我很有用（并且支持例如 # 和其他字符，但会增加“误报”）：

import re
text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd&params2=kjhdkjshd#changed
The code below catches all urls in text and returns urls in list."""

regex = r"(?i)(https?://|www.|\w+\.)[^\s]+"
urls = [match.group() for match in re.finditer(regex, text)]
print(urls)

和输出

[
'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string', 
'www.google.com,', 
'facebook.com,', 
'http://test.com/method?param=wasd,', 
'http://test.com/method?param=wasd&params2=kjhdkjshd#changed'
]

Answer 20

这是我创建的，它似乎工作得很好。如果您发现任何错误，请告诉我

((http|https|ftp):\/\/|www.)[^\"\'\,\* (\r|\n)]

支持多种语言

Answer 21

如果有人需要正则表达式来检测类似以下内容的Urls：

https://www.youtube.com/watch?v=38XmKNcgjSU
https://www.youtube.com/
www.youtube.com
youtube.com ...

我想到了这个正则表达式：

((http(s)?://)?([\w-]+\.)+[\w-]+[.com]+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)

Answer 22

我使用此正则表达式：

out T

它对许多URL都适用，例如： http://google.com，https://dev-site.io:8080/home?val=1&count=100，www.regexr.com，localhost：8080 / path，...

Answer 23

这是对Rajeev的回答略有改进/调整（取决于你的需要）：

([\w\-_]+(?:(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:/~\+#]*[A-Z\-\@?^=%&amp;/~\+#]){2,6}?

请参阅here，了解它的作用和不匹配的示例。

我摆脱了＃34; http＆＃34;因为我想在没有这个的情况下抓住网址。我稍微添加到正则表达式以捕获一些混淆的URL（即用户使用[dot]而不是＆＃34;。＆＃34;）。最后我更换了＃34; \ w＆＃34;用＆＃34; A-Z＆＃34;到和＆＃34; {2,3}＆＃34;减少误报，如v2.0和＆＃34; moo.0dd＆＃34;。

欢迎任何改进。

Answer 24

这是最简单的一个。这对我很好。

%(http|ftp|https|www)(://|\.)[A-Za-z0-9-_\.]*(\.)[a-z]*%

Answer 25

可能过于简单，但工作方法可能是：

.container {
    position: relative;
    left: -9999px;
}

.container:after {
    position: relative;
    left: 9999px;
}

我在Python上测试了它，只要字符串解析包含前后空格而且url中没有空格（我以前从未见过），它应该没问题。

Here is an online ide demonstrating it

然而，使用它有一些好处：

识别[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+和file:以及IP地址
如果没有他们永远匹配
不介意localhost或#等异常字符（请参阅此帖的网址）

Answer 26

我用过这个

^(https?:\\/\\/([a-zA-z0-9]+)(\\.[a-zA-z0-9]+)(\\.[a-zA-z0-9\\/\\=\\-\\_\\?]+)?)$

Answer 27

简短而简单。我还没有在javascript代码中测试过，但它看起来会起作用：

((http|ftp|https):\/\/)?(([\w.-]*)\.([\w]*))

Code on regex101.com

Answer 28

我使用下面的正则表达式来查找字符串中的url：

/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

Answer 29

匹配文本中的网址不应该太复杂

(?:(?:(?:ftp|http)[s]*:\/\/|www\.)[^\.]+\.[^ \n]+)

https://regex101.com/r/wewpP1/2

Answer 30

这是最好的一个。

NSString *urlRegex="(http|ftp|https|www|gopher|telnet|file)(://|.)([\\w_-]+(?:(?:\\.[\\w_-]+)‌+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?";

Answer 31

我使用在两个点或句点之间查找文本的逻辑

下面的正则表达式适用于python

(?<=\.)[^}]*(?=\.)

正则表达式以查找字符串中的URL

31 个答案: