Question

这是一个字符串： http://news.ycombinator.com/page?vasya=pupkin&b=b news.ycombinator.com/page news.ycombinator.com/page.php news.ycombinator.com/page

我正在使用页面提取主机。所以我写了以下正则表达式：

([a-zA-Z0-9\.]*[a-zA-Z0-9]+[^\/][\.][a-zA-Z0-9\/\.]+)

它返回给我（粗体）：

http：// news.ycombinator.com/page ？vasya = pupkin＆amp; b = b news.ycombinator.com/page news.ycombinator。 com / page.php news.ycombinator.com/page

这不是我需要的。在此字符串的情况下，Regexp不应该看到带有页面的主机：http://news.ycombinator.com/page?vasya=pupkin&b=b，因为它是一个链接，应该区别对待。

应该被拒绝：

"http://news.ycombinator.com/page?vasya=pupkin&b=b", "http://news.ycombinator.com/page", "http://news.ycombinator.com/","http://news.ycombinator.com".

不应该被拒绝：

"news.ycombinator.com/page","news.ycombinator.com/page.php", "news.ycombinator.com/page/index", "news.ycombinator.com/page/index.php"

如何改进此正则表达式，以便它只能选择那些附近没有单词字符的字符串部分？

Answer 1

我不确定你正在用什么来做你的正则表达式，但你实际上已经解决了你自己的问题 - 你只需要正则表达式来匹配整个单词。这将取决于您正在使用的程序，但这是一个guidleine（posix样式正则表达式）：

([:space:][a-zA-Z0-9\.]*[a-zA-Z0-9]+[^\/][\.][a-zA-Z0-9\/\.]+[:space:])

or maybe ([:space:]([a-zA-Z0-9]*[\.\/])+[a-zA-Z0-9]+[:space:])

在第二个中，您必须确保内部组用于非捕获组。

使用正则表达式提取字符串部分

1 个答案: