Question

我正在尝试使用preg_match()从Amazon网址中提取10个字符的ASIN号码。 URL可以是以下任何一种基本格式：

http://www.amazon.com/gp/product/ASIN
http://www.amazon.com/gp/product/[text]/ASIN
http://www.amazon.com/o/ASIN
http://www.amazon.com/dp/ASIN
http://www.amazon.com/[text]/dp/ASIN
http://www.amazon.com/[text]/dp/[text]/ASIN

注意：我遇到的问题源于在ASIN之后URL的末尾可能有也可能没有斜杠和变量。

在a previous question收到的帮助中，我提出了这个问题：

\/([A-Za-z0-9]{10})

我认为这是有效的，直到我在此网址上尝试过：

http://www.amazon.com/PlayStation-2-Console-Slim-Black/dp/B000TLU67W/ref=sr_1_4?ie=UTF8&qid=1389314719&sr=8-4&keywords=playstation+1

preg_match()的输出是：

Array
(
    [0] => /PlayStatio
    [1] => PlayStatio
)

然后我尝试在正则表达式的末尾添加斜杠，如下所示：

\/([A-Za-z0-9]{10})\/

修复了该问题，为上述网址提供了以下输出：

Array
(
    [0] => /B000TLU67W/
    [1] => B000TLU67W
)

但是，URL的末尾不会总是有斜杠。例如，如果修改为以下内容，则上述URL在Amazon上运行正常：

http://www.amazon.com/PlayStation-2-Console-Slim-Black/dp/B000TLU67W

我的修改后的正则表达式不适用于此URL，因为最后没有斜杠。

我认为可能有一个OR条件，看看比赛后是否有斜线，或者之后没有任何东西，可能会有效，但我不知道该怎么做..

有没有办法让正则表达式与上述两个网址一起使用？

Answer 1

您可以使用此正则表达式：

'#/([A-Z0-9]{10})(?=$|[/?#])#i'

即。 10位字母数字后跟斜线OR ?或刚刚输入结束。

在线演示：http://regex101.com/r/aE0jU8

Answer 2

很简单，只需在URL路径中找到 last 可能的ASIN值，如下所示：

if (preg_match('%
    # Fetch ASIN value from Amazon URL.
    (?<=/)                  # ASIN value always preceeded by slash.
    [A-Za-z0-9]{10}         # The ASIN value is exactly 10 alphanum.
    (?=                     # Assert no more ASIN values in path.
      (?:                   # Zero or more non-ASIN path segments.
        /                   # Path segment always begins with slash.
        (?!                 # Assert this path segment not ASIN.
          [A-Za-z0-9]{10}   # Is valid ASIN value if followed by
          (?:$|[/?\#])      # EOL/EOS or / or ? or # terminator.
        )                   # End assert this path segment not ASIN.
        (?:                 # Zero or more URI path characters.
          [A-Za-z0-9\-._~!$&\'()*+,;=:@]  # Either URI path char,
        | \%[0-9A-Fa-f]{2}  # or URI encoded value.
        )*                  # Zero or more URI path characters.
      )*                    # Zero or more non-ASIN path segments.
      (?=$|[?\#])           # Path ends on EOS, query or fragment.
    )                       # End assert no more ASIN values in path.
    %x', $subject, $matches)) {
    $ASIN = $matches[0];
} else {
    $ASIN = "";
}

已编辑20140110 12：30MTT ：第一个版本未能正确处理路径末尾的单斜杠。

Answer 3

tl; dr： (?:o|dp(?:\/[^/]+)?|gp\/product(?:\/[^/]+)?)\/([A-Z0-9]{10})

我想用10个字母命名的产品来解决你会遇到的问题。

让我们使用 CoffeeCups 产品，其中ASIN等于 C0FF33CUP5 ，以及相关的网址：

http://www.amazon.com/gp/product/C0FF33CUP5
http://www.amazon.com/gp/product/CoffeeCups/C0FF33CUP5  [*]
http://www.amazon.com/o/C0FF33CUP5
http://www.amazon.com/dp/C0FF33CUP5
http://www.amazon.com/CoffeeCups/dp/C0FF33CUP5
http://www.amazon.com/CoffeeCups/dp/some-text/C0FF33CUP5

我们的问题

原始正则表达式\/([A-Za-z0-9]{10})\/将因星号[*]网址

而失败

// http://www.amazon.com/gp/product/CoffeeCups/C0FF33CUP5
Array
(
    [0] => /CoffeeCups/
    [1] => CoffeeCups
)

无论有什么尾随斜线或其他参数，这个正则表达式已经无用了。你需要使用更好的一个。

区分大小写的方式

我首先建议使用区分大小写的正则表达式：\/([A-Z0-9]{10})\/，所以我们得到了：

// http://www.amazon.com/gp/product/CoffeeCups/C0FF33CUP5
Array
(
    [0] => /C0FF33CUP5/
    [1] => C0FF33CUP5
)

但我们确定ASIN会永远资本化吗？

更好的方法

更好的方法是考虑每种可能的模式，以确保我们获得ASIN（大写或不大写），而不是产品名称。

以下是您可以尝试的内容：

(?:o|dp(?:\/[^/]+)?|gp\/product(?:\/[^/]+)?)\/([A-Z0-9]{10})

enter image description here

(?:
    o                        # o
  | dp(?:\/[^/]+)?           # dp, dp/some-text
  | gp\/product(?:\/[^/]+)?  # gp/product, gp/product/some-text
)
\/                           # /
([A-Z0-9]{10})               # ASIN

Online demo

怎么写一个匹配的正则表达式只有在斜杠后匹配或者在匹配后什么都没有？

3 个答案:

在线演示：http://regex101.com/r/aE0jU8

我们的问题

区分大小写的方式

更好的方法