Question

我有这个问题：我必须制作一个带有这个网址的正则表达式： http://www.amazon.it/TP-LINK-TL-WR841N-Wireless-300Mbps-Ethernet/dp/B001FWYGJS?ie=UTF8&redirect=true&ref_=s9_simh_gw_p147_d0_i2

http://www.amazon.it/gp/product/B014KMQWU0/

http://www.amazon.it/gp/product/glance/B014KMQWU0/

我需要一个与完整网址匹配的正则表达式，直到产品的ASIN（ASIN是10个大写字母的单词）

我写了这个正则表达式，但没有按照我想要的那样：

String regex="http:\\/\\/(?:www\\.|)amazon\\.com\\/(?:gp\\ product|| gp\\ product\\ glance || [^\\/]+\\/dp|dp)\\/([^\\/]{10})";
        Pattern pattern=Pattern.compile(regex);
        Matcher urlAmazonMatcher = pattern.matcher(url);

        while (urlAmazonMatcher.find()) {

            System.out.println("PROVA "+urlAmazonMatcher.group(0));

        }

Answer 1

这是我的解决方案。最后它起作用：D

String regex="(http|www\\.)amazon\\.(com|it|uk|fr|de)\\/(?:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp)\\/([^\\/]{10})";
            Pattern pattern=Pattern.compile(regex);
            Matcher urlAmazonMatcher = pattern.matcher(url);
            String toReturn = null;
            while (urlAmazonMatcher.find()) {
               toReturn=urlAmazonMatcher.group(0);
            }

Answer 2

怎么样

/[^/?]{10}(/$|\?)

这匹配10个既不是/也不是？如果这些字符后面跟着最后的斜杠或问号，则跟随斜杠。

您可以使用各种Matcher函数之一获取ASIN之前或之后的部分。

Answer 3

以下是我之前从项目中提取网址的项目的工作：

    private Pattern getUriPattern() {
    if(uriPattern == null) {
        // taken from http://labs.apache.org/webarch/uri/rfc/rfc3986.html

        //TODO implement the full URI syntax

        String genDelims  = "\\:\\/\\?\\#\\[\\]\\@";
        String subDelims  = "\\!\\$\\&\\'\\*\\+\\,\\;\\=";
        String reserved = genDelims + subDelims;
        String unreserved = "\\w\\-\\.\\~"; // i.e. ALPHA / DIGIT / "-" / "." / "_" / "~"
        String allowed = reserved + unreserved;

        // ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
        uriPattern = Pattern.compile("((?:[^\\:/\\?\\#]+:)?//[" + allowed + "&&[^\\?\\#]]*(?:\\?([" + allowed + "&&[^\\#]]*))?(?:\\#[" + allowed + "]*)?).*");
    }
    return uriPattern;
}

您可以按如下方式使用上述方法：

    Matcher uriMatcher =
    getUriPattern().matcher(text);
if(uriMatcher.matches()) {
    String candidateUriString = uriMatcher.group(1);
    try {
        new URI(candidateUriString); // check once again if you matched a URL
        // your code here
    } catch (Exception e) {
        // error handling
    }

}

这将捕获整个URL，包括params。然后你可以把它分成第一次出现'？' （如果有的话）并采取第一部分。当然，你也可以重写正则表达式。

包含一些url的java中的正则表达式

3 个答案: