Question

如何匹配符合所有这些条件的网址：

该域名为 example.com ，但子域名不是 blog.example.com
第一个网址令牌不是＆＃34;新闻＆＃34; 或＆＃34;档案＆＃34; 或＆＃34;博客＆＃ 34; （即function update_item( $request ) { update_option( 'blogname', $request ); }）
后续的网址令牌都不是＆＃34;博客＆＃34; （即example.com/FIRST_URL_TOKEN）

所以：

example.com/FIRST_URL_TOKEN/SUBSEQUENT_URL_TOKEN/SUBSEQUENT_URL_TOKEN应匹配

http://example.com/test不匹配

http://blog.example.com/test不匹配

http://example.com/test/blog/test应匹配

这是我到目前为止所做的：

http://example.com/test/test2

但是，由于regex = /^http(s)?:\/\/(?!blog\.$)example.com(\.\w+)?\/(?!news$|archive$|blog$).*/不匹配，我错过了一些内容。

Answer 1

%r{^https?://[^/]*(?<!blog\.)example\.com/(?!news/|archives/|blog/)(?!.*/blog(/|$)).*}

See it in action

<小时/> 您的原始正则表达式存在一些问题。主要是，$并不代表我认为您的意思，也不排除blog/。

所以这是一个细分：

有一种替代语法可用于创建正则表达式%r{}，如果要转义正斜杠

，请使用它
^ - 从头开始

https?// - http // 或 https //

[^/]* - 多个字符，不是正斜杠 es

(?<!blog\.) - 负面观察，以确保子域名不是 blog.example.com

example\.com - example.com 域名本身

/(?!news/|archives/|blog/) - 首次删减后，＆＃34;网址标记＆＃34; 不是新闻或存档或博客

(?!.*/blog(/|$)) - 任何其他＆＃34;网址标记＆＃34; 不是博客

.* - 匹配其余字符

Answer 2

我建议编写一个简单的方法将测试分解为更小的部分，而不是使用复杂的正则表达式（通常会变得更加复杂和难以管理），并返回true / false是否为URL是有效/可用的。

require 'uri'

def match_uri(url)
  uri = URI.parse(url)

  if uri.host != 'example.com' ||
    uri.path[%r!^/(?:news|archives|blog)/!i] ||
    uri.path[%r!/blog/!i]
    return false
  end

  true
end


# 'http://example.com/test' should match
match_uri('http://example.com/test') # => true

# 'http://blog.example.com/test' should not match
match_uri('http://blog.example.com/test') # => false

# 'http://example.com/test/blog/test' should not match
match_uri('http://example.com/test/blog/test') # => false

# 'http://example.com/test/test2' should match
match_uri('http://example.com/test/test2') # => true

这是URI返回的内容：

uri = URI.parse('http://example.com/path/to/file')
uri.host # => "example.com"
uri.path # => "/path/to/file"

我看到你使用的逻辑唯一的问题是“path / to / file”实际上可能是“path / to / blog.ext”，这会导致逻辑中断。如果可能，请使用：

File.dirname(uri.path) # => "/path/to"

将删除文件名，因此测试只查看真实路径，而不是路径和文件：

def match_uri(url)
  uri = URI.parse(url)

  uri_dir = File.dirname(uri.path)

  if uri.host != 'example.com' ||
    uri_dir[%r!^/(?:news|archives|blog)!i] ||
    uri_dir[%r!/blog!i]
    return false
  end

  true
end

“Regular Expressions: Now You Have Two Problems”是一本很好的读物。

如何获取没有单词作为标记的URL的正则表达式？

2 个答案: