URL normalization(或URL规范化)是以一致的方式修改和标准化URL的过程。规范化过程的目标是将URL转换为规范化或规范化URL,以便确定两个语法上不同的URL是否相同。
策略包括添加尾部斜杠,https => http等。维基百科页面列出了很多。
在Java中有一个最喜欢的方法吗?也许是一个图书馆(Nutch?),但我是开放的。较小和较少的依赖关系更好。
我现在会手动编码,并密切关注这个问题。
编辑:如果他们引用相同的内容,我想积极规范化以统计网址。例如,我忽略了参数utm_source,utm_medium,utm_campaign。例如,如果标题相同,我会忽略子域。
答案 0 :(得分:22)
答案 1 :(得分:19)
我昨晚发现了这个问题,但我找不到答案,所以我自己做了。在这里,未来有人会想要它:
/**
* - Covert the scheme and host to lowercase (done by java.net.URL)
* - Normalize the path (done by java.net.URI)
* - Add the port number.
* - Remove the fragment (the part after the #).
* - Remove trailing slash.
* - Sort the query string params.
* - Remove some query string params like "utm_*" and "*session*".
*/
public class NormalizeURL
{
public static String normalize(final String taintedURL) throws MalformedURLException
{
final URL url;
try
{
url = new URI(taintedURL).normalize().toURL();
}
catch (URISyntaxException e) {
throw new MalformedURLException(e.getMessage());
}
final String path = url.getPath().replace("/$", "");
final SortedMap<String, String> params = createParameterMap(url.getQuery());
final int port = url.getPort();
final String queryString;
if (params != null)
{
// Some params are only relevant for user tracking, so remove the most commons ones.
for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
{
final String key = i.next();
if (key.startsWith("utm_") || key.contains("session"))
{
i.remove();
}
}
queryString = "?" + canonicalize(params);
}
else
{
queryString = "";
}
return url.getProtocol() + "://" + url.getHost()
+ (port != -1 && port != 80 ? ":" + port : "")
+ path + queryString;
}
/**
* Takes a query string, separates the constituent name-value pairs, and
* stores them in a SortedMap ordered by lexicographical order.
* @return Null if there is no query string.
*/
private static SortedMap<String, String> createParameterMap(final String queryString)
{
if (queryString == null || queryString.isEmpty())
{
return null;
}
final String[] pairs = queryString.split("&");
final Map<String, String> params = new HashMap<String, String>(pairs.length);
for (final String pair : pairs)
{
if (pair.length() < 1)
{
continue;
}
String[] tokens = pair.split("=", 2);
for (int j = 0; j < tokens.length; j++)
{
try
{
tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
}
catch (UnsupportedEncodingException ex)
{
ex.printStackTrace();
}
}
switch (tokens.length)
{
case 1:
{
if (pair.charAt(0) == '=')
{
params.put("", tokens[0]);
}
else
{
params.put(tokens[0], "");
}
break;
}
case 2:
{
params.put(tokens[0], tokens[1]);
break;
}
}
}
return new TreeMap<String, String>(params);
}
/**
* Canonicalize the query string.
*
* @param sortedParamMap Parameter name-value pairs in lexicographical order.
* @return Canonical form of query string.
*/
private static String canonicalize(final SortedMap<String, String> sortedParamMap)
{
if (sortedParamMap == null || sortedParamMap.isEmpty())
{
return "";
}
final StringBuffer sb = new StringBuffer(350);
final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();
while (iter.hasNext())
{
final Map.Entry<String, String> pair = iter.next();
sb.append(percentEncodeRfc3986(pair.getKey()));
sb.append('=');
sb.append(percentEncodeRfc3986(pair.getValue()));
if (iter.hasNext())
{
sb.append('&');
}
}
return sb.toString();
}
/**
* Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
* according to the RFC, so we make the extra replacements.
*
* @param string Decoded string.
* @return Encoded string per RFC 3986.
*/
private static String percentEncodeRfc3986(final String string)
{
try
{
return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
}
catch (UnsupportedEncodingException e)
{
return string;
}
}
}
答案 2 :(得分:3)
不,标准库中没有任何内容可以执行此操作。规范化包括解码不必要的编码字符,将主机名转换为小写等等。
e.g。 http://ACME.com/./foo%26bar
成为:
http://acme.com/foo&bar
URI normalize()
不执行此操作。
答案 3 :(得分:3)
RL库: https://github.com/backchatio/rl 除了java.net.URL.normalize()之外还有很多方法。 它在Scala中,但我认为它应该可以从Java中使用。
答案 4 :(得分:2)
因为您还想识别引用相同内容的网址,所以我发现WWW2007中的这篇论文非常有趣:Do Not Crawl in the DUST: Different URLs with Similar Text。它为您提供了一个很好的理论方法。
答案 5 :(得分:1)
您可以使用Restlet使用Reference.normalize()
框架执行此操作。您还应该能够使用此类删除不太方便的元素。
答案 6 :(得分:1)
在Java中,手动规范化网址
String company_website = "http://www.foo.bar.com/whatever&stuff";
try {
URL url = new URL(company_website);
System.out.println(url.getProtocol() + "://" + url.getHost());
} catch (MalformedURLException e) {
e.printStackTrace();
}
//prints `http://www.foo.bar.com`
java URL类有各种各样的方法来解析URL的任何部分。
答案 7 :(得分:0)
我有一种简单的方法来解决它。这是我的代码
// you can use the isset() function to see if post values are set
if (isset($_POST['submit'])){
// empty() functions checks if the value is set or not null
if(empty($_POST['tiptitle'])) {
$error .= "An Title is required<br>";
} else{
// etra validation.... mysql espaces...etc
}
// empty() functions checks if the value is set or not null
if(empty($_POST['tiptext'])) {
$error .= "Text is required<br>";
} else{
// etra validation.... mysql espaces...etc
}
// **** an empty string, "", is considered false in PHP
// you had it in a 'string' format, which is always true.
if($error) {
$dangererror = "<div class='alert alert-danger'>";
$dangererror .= $error;
$dangererror .= "</div>";
} else {
/* Attempt MySQL server connection. Assuming you are running MySQL
server with default setting (user 'root' with no password) */
$mysqli = new mysqli("localhost", "paul", "pass", "yourcomp");
// Check connection
if($mysqli === false){
die("ERROR: Could not connect. " . $mysqli->connect_error);
}
// Prepare an insert statement
............
// Close statement
$stmt->close();
// Close connection
$mysqli->close();
}
}