这里我从几个不同的导航网站抓取了所有网站,其中一些是重复的,我的意思是,例如:
http://www.hao123.com/index.htm和http://www.hao123.com
这两个内容具有相同的内容,当然还有其他情况,例如缺少斜杠。通过单独使用URL,我仍将它们视为两个不同的站点。
我的问题是:有没有有效的方法将它们识别为一个网站?谢谢!
答案 0 :(得分:2)
我知道没有万无一失的做法。
话虽如此,一种方法可能是从每个URL加载内容,然后将Levenshtein距离算法应用于同一域名下的所有页面。然后,您可以设置一个阈值,以确定内容在被认为是相同之前是如何“相似”的(如果内容稍有变化,我想它的大部分仍然是相同的。)类似于页面长度的10%可能是这个价值的一个很好的起点。
这可能相对较慢,具体取决于您拥有的网站数量,但会考虑到每个加载内容的轻微差异,而简单的哈希或长度计算则不会。
为了使这一点更可靠,您可以检查某些事物是否与您期望的负载相同(或不相同) - 例如页面的标题。
答案 1 :(得分:1)
使用正则表达式解析域名
示例代码段:
String a = "http://www.google.com";
String tempString = a.substring(a.indexOf(".")+1, a.length()); // gets rid of everything before the first dot
String domainString = tempString.substring(0, tempString.indexOf(".")); // grabs everything before the second dot
System.out.println(domainString);
输出google
编辑:
这是一个独立的示例演示,可以处理更复杂的域结构并提取单个组件。
您可以在下面的源代码中的main方法中添加更多域测试用例来测试各个域,但目前它正在测试以下域:
http://www.google.com/
ftp://www.google.com
http://google.com/
google.com
localhost:80
这是来源( Pardon my lazy spaghetti ):
package domain.parser.test;
public class Parseromatic {
public static void main(String[] args) {
Parseromatic parser = new Parseromatic();
parser.extract("http://www.google.com/");
parser.extract("ftp://www.google.com");
parser.extract("http://google.com/");
parser.extract("google.com");
parser.extract("localhost:80");
}
public void extract(String a){
if(a.contains(".")){ // Initial outOfBounds proof check in cases like (http://localhost:80)
String leadingString = a.substring(0, a.indexOf(".")); // First portion of the URL
boolean hasProto = protocol(leadingString);
// Now lets grab the rest
String trailingString = a.substring(a.indexOf(".")+1, a.length());
// Check if it contains a forward-slash
if(trailingString.contains("/")){
// We snip out everything before the slash
String middleString = snipOffPages(trailingString);
// Now we're only left with the domain related things
// Check if subdomain was left in the leadingString
if(middleString.contains(".")){
// Yep so lets deal with that
if(hasProto){ // If it had a protocol
System.out.println("Subdomain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
} else { // If it didn't have a protocol
System.out.println("Subdomain: "+leadingString);
}
// Now let's split up the rest
String[] split1 = middleString.split("\\.");
System.out.println("Domain: "+split1[0]);
// Check for port
if (split1[1].contains(":")){
// Assuming port is specified
String[] split2 = split1[1].split(":");
System.out.println("Top-Domain: "+split2[0]);
System.out.println("Port: "+split2[1]);
} else {
// Assuming no port specified
System.out.println("Top-Domain: "+split1[1]);
System.out.println("Port: N/A");
}
} else {
// No subdomain was present
System.out.println("Subdomain: N/A");
if(hasProto){ // If it had a protocol
System.out.println("Domain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
} else { // If it didn't have a protocol
System.out.println("Domain: "+leadingString);
}
// Check for port
if (middleString.contains(":")){
// Assuming port is specified
String[] split2 = middleString.split(":");
System.out.println("Top-Domain: "+split2[0]);
System.out.println("Port: "+split2[1]);
} else {
// Assuming no port specified
System.out.println("Top-Domain: "+middleString);
System.out.println("Port: N/A");
}
}
} else { // We assume it only contains domain related things
if(trailingString.contains(".")){
// Yep so lets deal with that
if(hasProto){ // If it had a protocol
System.out.println("Subdomain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
} else { // If it didn't have a protocol
System.out.println("Subdomain: "+leadingString);
}
// Now let's split up the rest
String[] split1 = trailingString.split("\\.");
System.out.println("Domain: "+split1[0]);
// Check for port
if (split1[1].contains(":")){
// Assuming port is specified
String[] split2 = split1[1].split(":");
System.out.println("Top-Domain: "+split2[0]);
System.out.println("Port: "+split2[1]);
} else {
// Assuming no port specified
System.out.println("Top-Domain: "+split1[1]);
System.out.println("Port: N/A");
}
} else {
// No subdomain was present
System.out.println("Subdomain: N/A");
if(hasProto){ // If it had a protocol
System.out.println("Domain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
} else { // If it didn't have a protocol
System.out.println("Domain: "+leadingString);
}
// Check for port
if (trailingString.contains(":")){
// Assuming port is specified
String[] split2 = trailingString.split(":");
System.out.println("Top-Domain: "+split2[0]);
System.out.println("Port: "+split2[1]);
} else {
// Assuming no port specified
System.out.println("Top-Domain: "+trailingString);
System.out.println("Port: N/A");
}
}
}
} else {
// Assuming only one level exists
boolean hasProto = protocol(a);
// Check if protocol was present
if(hasProto){
String noProto = a.substring(a.indexOf("://")+3, a.length());
// If some pages or something is specified
if(noProto.contains("/")){
noProto = snipOffPages(noProto);
}
// Check for port
if(noProto.contains(":")){
String[] split1 = noProto.split(":");
System.out.println("Subdomain: N/A");
System.out.println("Domain: "+split1[0]);
System.out.println("Top-Domain: N/A");
System.out.println("Port: "+split1[1]);
} else {
System.out.println("Subdomain: N/A");
System.out.println("Domain: "+noProto);
System.out.println("Top-Domain: N/A");
System.out.println("Port: N/A");
}
} else {
// If some pages or something is specified
if(a.contains("/")){
a = snipOffPages(a);
}
// Check for port
if(a.contains(":")){
String[] split1 = a.split(":");
System.out.println("Subdomain: N/A");
System.out.println("Domain: "+split1[0]);
System.out.println("Top-Domain: N/A");
System.out.println("Port: "+split1[1]);
} else {
System.out.println("Subdomain: N/A");
System.out.println("Domain: "+a);
System.out.println("Top-Domain: N/A");
System.out.println("Port: N/A");
}
}
}
System.out.println(); // Cosmetic empty line, can ignore
}
public String snipOffPages(String a){
return a.substring(0,a.indexOf("/"));
}
public boolean protocol(String a) {
// Protocol extraction
if(a.contains("://")){ // Check for existance of protocol declaration
String protocolString = a.substring(0, a.indexOf("://"));
System.out.println("Protocol: "+protocolString);
return true;
}
else {
System.out.println("Protocol: N/A");
return false;
}
}
}
对于上面指定的域,它输出:
Protocol: http
Subdomain: www
Domain: google
Top-Domain: com
Port: N/A
Protocol: ftp
Subdomain: www
Domain: google
Top-Domain: com
Port: N/A
Protocol: http
Subdomain: N/A
Domain: google
Top-Domain: com
Port: N/A
Protocol: N/A
Subdomain: N/A
Domain: google
Top-Domain: com
Port: N/A
Protocol: N/A
Subdomain: N/A
Domain: localhost
Top-Domain: N/A
Port: 80
答案 2 :(得分:0)
最好的方法是使用正则表达式来获取域名并保留所有域名的列表。每当您检查新的URL检查时,也会检查“已访问”域名列表。 以下是关于如何获取域名的较旧问题: