Question

我需要解析段落中的所有URL（字符串）
例如

“查看此网站google.com，别忘了看到这个bing.com/maps”

它应该返回“google.com和bing.com/maps”

我目前正在使用它而不是完美。

reMatch("(^|\s)[^\s@]+\.[^\s@\?\/]{2,5}((\?|\/)\S*)?",mystring)

感谢

Answer 1

您需要更清楚地定义您认为的网址

例如，我可能会使用如下内容：

(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?

（与reMatchNoCase或plonk (?i)一起使用以忽略大小写）

特别是在域和路径部分中仅允许使用字母数字，下划线和连字符，要求TLD仅为字母，并且仅查找数字端口。

这可能是足够好的，或者你可能需要一些寻找更多角色的东西，或者你想要在URL的末尾修剪诸如引号，括号等的东西，或者其他什么 - 它取决于上下文你正在做什么，你是否想错误的URL或检测非URL。（我可能会选择后者，然后可能会运行辅助过滤器以验证某些内容是否为某个URL，但这需要更多的工作，而且可能不是您正在做的事情所必需的。）

无论如何，上面的表达式的解释如下，希望有明确的评论来帮助它有意义。 :) （请注意，所有群组都是非捕获(?: ... )，因为我们不需要indiv部分。）

# PROTOCOL
 (?:https?:)?    # optional group of "http:" or "https:"

# SERVER NAME / DOMAIN
 (?://)?         # optional double forward slash
 (?:[\w-]+\.)+   # one or more "word characters" or hyphens, followed by a literal .
                 # grouped together and repeated one or more times
 [a-z]{2,6}      # as many as 6 alphas, but at least 2

# PORT NUMBER
 (?::\d+)?       # an optional group made up of : and one or more digits

# PATH INFO
 (?:/[\w.,-]+)*  # a forward slash then multiple alphanumeric, underscores, or hyphens
                 # or dots or commas (add any other characters as required)
                 # in a group that might occur multiple times (or not at all)

# QUERY STRING
 (?:\?\S+)?      # an optional group containing ? then any non-whitespace

<强>更新为了防止电子邮件地址的匹配结束，我们需要使用lookbehind，以确保在URL之前我们没有@符号（或其他任何不需要的东西）但没有在匹配中实际包含该前一个字符。 / p>

CF的正则表达式是Apache ORO，它不支持lookbehinds，但是我们可以使用java.util.regex很好地轻松地使用支持lookbehinds的a component I have created。

使用它就像：

<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
...
<cfset Urls = jrex.match( regex , input ) />

在createObject之后，它应该基本上就像使用内置的re~ stuff，但是语法略有差异，并且引擎盖下的正则表达式引擎不同。

（如果您对该组件有任何问题或疑问，请与我们联系。）

因此，您可以从URL匹配问题中排除电子邮件：

我们可以做(?<=肯定)或(?<!否定)看守，这取决于我们是否要说“我们必须拥有这个”或“我们必须没有这个“，就像这样：

(?<=\s) # there must be whitespace before the current position
(?<!@)  # there must NOT be an @ before current position

对于此URL示例，我会将这些示例之一扩展为：

(?<=\s|^)   # look for whitespace OR start of string

或

(?<![@\w/]) # ensure there is not a @ or / or word character.

两者都可以工作（并且可以使用更多的字符进行扩展），但是以不同的方式，所以它只取决于您想要使用哪种方法。

在你的表达开头放任何你喜欢的，它不应该与abcd@gmail.com的结尾相匹配，除非我搞砸了。：）

更新2：

以下是一些示例代码，它将从匹配项中排除任何电子邮件地址：

<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />

<cfsavecontent variable="SampleInput">
check out this site google.com and don't forget to see this too bing.com/maps
this is an email@somewhere.com which should not be matched
</cfsavecontent>

<cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />

<cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />

<cfdump var=#MatchedUrls#/>

确保您已从here下载了jre-utils.cfc并放入适当的位置（例如，与运行此代码的脚本相同的目录）。

此步骤是必需的，因为(?<= ... )构造在CF正则表达式中不起作用。

在coldfusion中解析来自字符串的url

1 个答案: