Ruby正则表达式 - 提取URL的一部分

时间:2014-07-22 17:37:15

标签: ruby regex

我有一个像

这样的网址
https://endpoint/v1.0/album/id/photo/id/

其中endpoint是变量。我想提取“/v1.0/album/id/photo/id/”。

如何使用Ruby正则表达式在“端点”之后提取所有内容?

4 个答案:

答案 0 :(得分:4)

我们走了:

2.0.0-p451 :001 > require 'uri'
 => true
2.0.0-p451 :002 > URI('https://endpoint/v1.0/album/id/photo/id/').path
 => "/v1.0/album/id/photo/id/"
2.0.0-p451 :003 >

阅读此Basic example

答案 1 :(得分:1)

完整的正则表达式解决方案是URI library does in the background。独自完成这在很大程度上是徒劳的......

在任何情况下,使用命名捕获组(?<name>)和末尾的/x标志的简单正则表达式允许格式化中的空格。

url = 'https://endpoint/v1.0/album/id/photo/id/'

re = /
              ^                    # beginning of string
  (?<scheme>  https?             ) # http or s
              :\/\/                # seperator
  (?<domain>  [[a-zA-Z0-9]\.-]+? ) # many alnum, -'s or .'s
  (?<path>    \/.+               ) # forward slash on is the path
/x

res = url.match re
res[:path] if res

与URI

相比,这相形见绌

答案 2 :(得分:0)

这是一个正则表达式解决方案:

domain = 'endpoint'
link = "https://#{domain}/v1.0/album/id/photo/id/"
path = link.gsub("https://#{domain}", '')
# => "/v1.0/album/id/photo/id/"

您可以通过更改&#34;域&#34;来调整域名。变量。 我使用String.gsub函数用空字符串替换链接的第一部分(在第3行完成的正则表达式部分实际上非常简单!它的字面意思是http://端点),这意味着path是将保留的字符串的唯一部分。

答案 3 :(得分:0)

URI RFC文档the pattern used to parse a URL

Appendix B.  Parsing a URI Reference with a Regular Expression

   As the "first-match-wins" algorithm is identical to the "greedy"
   disambiguation method used by POSIX regular expressions, it is
   natural and commonplace to use a regular expression for parsing the
   potential five components of a URI reference.

   The following line is the regular expression for breaking-down a
   well-formed URI reference into its components.



Berners-Lee, et al.         Standards Track                    [Page 50]

RFC 3986                   URI Generic Syntax               January 2005


      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9

   The numbers in the second line above are only to assist readability;
   they indicate the reference points for each subexpression (i.e., each
   paired parenthesis).  We refer to the value matched for subexpression
   <n> as $<n>.  For example, matching the above expression to

      http://www.ics.uci.edu/pub/ietf/uri/#Related

   results in the following subexpression matches:

      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

   where <undefined> indicates that the component is not present, as is
   the case for the query component in the above example.  Therefore, we
   can determine the value of the five components as

      scheme    = $2
      authority = $4
      path      = $5
      query     = $7
      fragment  = $9

基于此:

URL_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures
# => ["https:",
#     "https",
#     "//endpoint",
#     "endpoint",
#     "/v1.0/album/id/photo/id/",
#     nil,
#     nil,
#     nil,
#     nil]

'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures[4]
# => "/v1.0/album/id/photo/id/"