如何使用Java区分网站?

时间:2014-01-07 13:58:14

标签: java web-crawler

这里我从几个不同的导航网站抓取了所有网站,其中一些是重复的,我的意思是,例如:

http://www.hao123.com/index.htmhttp://www.hao123.com

这两个内容具有相同的内容,当然还有其他情况,例如缺少斜杠。通过单独使用URL,我仍将它们视为两个不同的站点。

我的问题是:有没有有效的方法将它们识别为一个网站?谢谢!

3 个答案:

答案 0 :(得分:2)

我知道没有万无一失的做法。

话虽如此,一种方法可能是从每个URL加载内容,然后将Levenshtein距离算法应用于同一域名下的所有页面。然后,您可以设置一个阈值,以确定内容在被认为是相同之前是如何“相似”的(如果内容稍有变化,我想它的大部分仍然是相同的。)类似于页面长度的10%可能是这个价值的一个很好的起点。

这可能相对较慢,具体取决于您拥有的网站数量,但会考虑到每个加载内容的轻微差异,而简单的哈希或长度计算则不会。

为了使这一点更可靠,您可以检查某些事物是否与您期望的负载相同(或不相同) - 例如页面的标题。

答案 1 :(得分:1)

使用正则表达式解析域名

示例代码段

String a = "http://www.google.com";

String tempString = a.substring(a.indexOf(".")+1, a.length()); // gets rid of everything before the first dot

String domainString = tempString.substring(0, tempString.indexOf(".")); // grabs everything before the second dot

System.out.println(domainString);

输出google

编辑:

这是一个独立的示例演示,可以处理更复杂的域结构并提取单个组件。

您可以在下面的源代码中的main方法中添加更多域测试用例来测试各个域,但目前它正在测试以下域:

http://www.google.com/

ftp://www.google.com

http://google.com/

google.com

localhost:80

这是来源( Pardon my lazy spaghetti ):

package domain.parser.test;

public class Parseromatic {

    public static void main(String[] args) {

        Parseromatic parser = new Parseromatic();
        parser.extract("http://www.google.com/");
        parser.extract("ftp://www.google.com");
        parser.extract("http://google.com/");
        parser.extract("google.com");
        parser.extract("localhost:80");

    }

    public void extract(String a){

        if(a.contains(".")){ // Initial outOfBounds proof check in cases like (http://localhost:80)
            String leadingString = a.substring(0, a.indexOf(".")); // First portion of the URL

            boolean hasProto = protocol(leadingString);

            // Now lets grab the rest
            String trailingString = a.substring(a.indexOf(".")+1, a.length());

            // Check if it contains a forward-slash
            if(trailingString.contains("/")){

                // We snip out everything before the slash

                String middleString = snipOffPages(trailingString);

                // Now we're only left with the domain related things

                // Check if subdomain was left in the leadingString

                if(middleString.contains(".")){
                    // Yep so lets deal with that

                    if(hasProto){ // If it had a protocol
                        System.out.println("Subdomain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Subdomain: "+leadingString);
                    }

                    // Now let's split up the rest

                    String[] split1 = middleString.split("\\.");

                    System.out.println("Domain: "+split1[0]);

                    // Check for port
                    if (split1[1].contains(":")){

                        // Assuming port is specified

                        String[] split2 = split1[1].split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+split1[1]);

                        System.out.println("Port: N/A");
                    }


                } else {

                    // No subdomain was present

                    System.out.println("Subdomain: N/A");

                    if(hasProto){ // If it had a protocol
                        System.out.println("Domain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Domain: "+leadingString);
                    }

                    // Check for port
                    if (middleString.contains(":")){

                        // Assuming port is specified

                        String[] split2 = middleString.split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+middleString);

                        System.out.println("Port: N/A");
                    }

                }


            } else { // We assume it only contains domain related things

                if(trailingString.contains(".")){
                    // Yep so lets deal with that

                    if(hasProto){ // If it had a protocol
                        System.out.println("Subdomain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Subdomain: "+leadingString);
                    }

                    // Now let's split up the rest

                    String[] split1 = trailingString.split("\\.");

                    System.out.println("Domain: "+split1[0]);

                    // Check for port
                    if (split1[1].contains(":")){

                        // Assuming port is specified

                        String[] split2 = split1[1].split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+split1[1]);

                        System.out.println("Port: N/A");
                    }


                } else {

                    // No subdomain was present

                    System.out.println("Subdomain: N/A");

                    if(hasProto){ // If it had a protocol
                        System.out.println("Domain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Domain: "+leadingString);
                    }

                    // Check for port
                    if (trailingString.contains(":")){

                        // Assuming port is specified

                        String[] split2 = trailingString.split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+trailingString);

                        System.out.println("Port: N/A");
                    }

                }

            }

        } else {

            // Assuming only one level exists

            boolean hasProto = protocol(a);

            // Check if protocol was present
            if(hasProto){
                String noProto = a.substring(a.indexOf("://")+3, a.length());

                // If some pages or something is specified
                if(noProto.contains("/")){
                    noProto = snipOffPages(noProto);
                }

                // Check for port
                if(noProto.contains(":")){

                    String[] split1 = noProto.split(":");

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+split1[0]);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: "+split1[1]);

                } else {

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+noProto);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: N/A");

                }

            } else {

                // If some pages or something is specified
                if(a.contains("/")){
                    a = snipOffPages(a);
                }

                // Check for port
                if(a.contains(":")){

                    String[] split1 = a.split(":");

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+split1[0]);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: "+split1[1]);

                } else {

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+a);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: N/A");

                }

            }



        }

        System.out.println(); // Cosmetic empty line, can ignore


    }

    public String snipOffPages(String a){
        return a.substring(0,a.indexOf("/"));
    }

    public boolean protocol(String a) {
        // Protocol extraction
        if(a.contains("://")){ // Check for existance of protocol declaration
            String protocolString = a.substring(0, a.indexOf("://"));
            System.out.println("Protocol: "+protocolString);
            return true;
        }
        else {
            System.out.println("Protocol: N/A");
            return false;
        }
    }

}

对于上面指定的域,它输出:

Protocol: http
Subdomain: www
Domain: google
Top-Domain: com
Port: N/A

Protocol: ftp
Subdomain: www
Domain: google
Top-Domain: com
Port: N/A

Protocol: http
Subdomain: N/A
Domain: google
Top-Domain: com
Port: N/A

Protocol: N/A
Subdomain: N/A
Domain: google
Top-Domain: com
Port: N/A

Protocol: N/A
Subdomain: N/A
Domain: localhost
Top-Domain: N/A
Port: 80

答案 2 :(得分:0)

最好的方法是使用正则表达式来获取域名并保留所有域名的列表。每当您检查新的URL检查时,也会检查“已访问”域名列表。 以下是关于如何获取域名的较旧问题:

Get domain name from given url