如何在Java中规范化URL?

时间:2010-06-07 22:33:12

标签: java url-rewriting

  

URL normalization(或URL规范化)是以一致的方式修改和标准化URL的过程。规范化过程的目标是将URL转换为规范化或规范化URL,以便确定两个语法上不同的URL是否相同。

策略包括添加尾部斜杠,https => http等。维基百科页面列出了很多。

在Java中有一个最喜欢的方法吗?也许是一个图书馆(Nutch?),但我是开放的。较小和较少的依赖关系更好。

我现在会手动编码,并密切关注这个问题。

编辑:如果他们引用相同的内容,我想积极规范化以统计网址。例如,我忽略了参数utm_source,utm_medium,utm_campaign。例如,如果标题相同,我会忽略子域。

8 个答案:

答案 0 :(得分:22)

答案 1 :(得分:19)

我昨晚发现了这个问题,但我找不到答案,所以我自己做了。在这里,未来有人会想要它:

/**
 * - Covert the scheme and host to lowercase (done by java.net.URL)
 * - Normalize the path (done by java.net.URI)
 * - Add the port number.
 * - Remove the fragment (the part after the #).
 * - Remove trailing slash.
 * - Sort the query string params.
 * - Remove some query string params like "utm_*" and "*session*".
 */
public class NormalizeURL
{
    public static String normalize(final String taintedURL) throws MalformedURLException
    {
        final URL url;
        try
        {
            url = new URI(taintedURL).normalize().toURL();
        }
        catch (URISyntaxException e) {
            throw new MalformedURLException(e.getMessage());
        }

        final String path = url.getPath().replace("/$", "");
        final SortedMap<String, String> params = createParameterMap(url.getQuery());
        final int port = url.getPort();
        final String queryString;

        if (params != null)
        {
            // Some params are only relevant for user tracking, so remove the most commons ones.
            for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
            {
                final String key = i.next();
                if (key.startsWith("utm_") || key.contains("session"))
                {
                    i.remove();
                }
            }
            queryString = "?" + canonicalize(params);
        }
        else
        {
            queryString = "";
        }

        return url.getProtocol() + "://" + url.getHost()
            + (port != -1 && port != 80 ? ":" + port : "")
            + path + queryString;
    }

    /**
     * Takes a query string, separates the constituent name-value pairs, and
     * stores them in a SortedMap ordered by lexicographical order.
     * @return Null if there is no query string.
     */
    private static SortedMap<String, String> createParameterMap(final String queryString)
    {
        if (queryString == null || queryString.isEmpty())
        {
            return null;
        }

        final String[] pairs = queryString.split("&");
        final Map<String, String> params = new HashMap<String, String>(pairs.length);

        for (final String pair : pairs)
        {
            if (pair.length() < 1)
            {
                continue;
            }

            String[] tokens = pair.split("=", 2);
            for (int j = 0; j < tokens.length; j++)
            {
                try
                {
                    tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
                }
                catch (UnsupportedEncodingException ex)
                {
                    ex.printStackTrace();
                }
            }
            switch (tokens.length)
            {
                case 1:
                {
                    if (pair.charAt(0) == '=')
                    {
                        params.put("", tokens[0]);
                    }
                    else
                    {
                        params.put(tokens[0], "");
                    }
                    break;
                }
                case 2:
                {
                    params.put(tokens[0], tokens[1]);
                    break;
                }
            }
        }

        return new TreeMap<String, String>(params);
    }

    /**
     * Canonicalize the query string.
     *
     * @param sortedParamMap Parameter name-value pairs in lexicographical order.
     * @return Canonical form of query string.
     */
    private static String canonicalize(final SortedMap<String, String> sortedParamMap)
    {
        if (sortedParamMap == null || sortedParamMap.isEmpty())
        {
            return "";
        }

        final StringBuffer sb = new StringBuffer(350);
        final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();

        while (iter.hasNext())
        {
            final Map.Entry<String, String> pair = iter.next();
            sb.append(percentEncodeRfc3986(pair.getKey()));
            sb.append('=');
            sb.append(percentEncodeRfc3986(pair.getValue()));
            if (iter.hasNext())
            {
                sb.append('&');
            }
        }

        return sb.toString();
    }

    /**
     * Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
     * according to the RFC, so we make the extra replacements.
     *
     * @param string Decoded string.
     * @return Encoded string per RFC 3986.
     */
    private static String percentEncodeRfc3986(final String string)
    {
        try
        {
            return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
        }
        catch (UnsupportedEncodingException e)
        {
            return string;
        }
    }
}

答案 2 :(得分:3)

不,标准库中没有任何内容可以执行此操作。规范化包括解码不必要的编码字符,将主机名转换为小写等等。

e.g。 http://ACME.com/./foo%26bar成为:

http://acme.com/foo&bar

URI normalize() 执行此操作。

答案 3 :(得分:3)

RL库: https://github.com/backchatio/rl 除了java.net.URL.normalize()之外还有很多方法。 它在Scala中,但我认为它应该可以从Java中使用。

答案 4 :(得分:2)

因为您还想识别引用相同内容的网址,所以我发现WWW2007中的这篇论文非常有趣:Do Not Crawl in the DUST: Different URLs with Similar Text。它为您提供了一个很好的理论方法。

答案 5 :(得分:1)

您可以使用Restlet使用Reference.normalize()框架执行此操作。您还应该能够使用此类删除不太方便的元素。

答案 6 :(得分:1)

在Java中,手动规范化网址

String company_website = "http://www.foo.bar.com/whatever&stuff";

try {
    URL url = new URL(company_website);
    System.out.println(url.getProtocol() + "://" + url.getHost());
} catch (MalformedURLException e) {
    e.printStackTrace();
}

//prints `http://www.foo.bar.com`

java URL类有各种各样的方法来解析URL的任何部分。

答案 7 :(得分:0)

我有一种简单的方法来解决它。这是我的代码

// you can use the isset() function to see if post values are set 
if (isset($_POST['submit'])){

    // empty() functions checks if the value is set or not null
    if(empty($_POST['tiptitle'])) {
        $error .= "An Title is required<br>";
    } else{
        // etra validation.... mysql espaces...etc
    }

    // empty() functions checks if the value is set or not null
    if(empty($_POST['tiptext'])) {
        $error .= "Text is required<br>";
    } else{ 
        // etra validation.... mysql espaces...etc
    }

    // ****  an empty string, "", is considered false in PHP
    // you had it in a 'string' format, which is always true.

    if($error) {
        $dangererror = "<div class='alert alert-danger'>";
        $dangererror .= $error;
        $dangererror .= "</div>";
    } else {

        /* Attempt MySQL server connection. Assuming you are running MySQL
        server with default setting (user 'root' with no password) */
        $mysqli = new mysqli("localhost", "paul", "pass", "yourcomp");

        // Check connection
        if($mysqli === false){
            die("ERROR: Could not connect. " . $mysqli->connect_error);
        }

        // Prepare an insert statement

        ............

        // Close statement
        $stmt->close();

        // Close connection
        $mysqli->close();
    }
}