如何删除URL的子域部分

时间:2014-08-30 20:32:51

标签: java

我正在尝试删除子域名,只留下域名后跟扩展名。

很难找到子域名,因为我不知道网址中有多少点。例如,某些网址以.co.uk中的.com结尾。

如何安全删除子域名,以便foo.bar.com成为bar.com,foo.bar.co.uk成为bar.co.uk

if(!rawUrl.startsWith("http://")&&!rawUrl.startsWith("https://")){
    rawUrl = "http://"+rawUrl;
}
String url = new java.net.URL(rawUrl).getHost();
String urlWithoutSub = ???

2 个答案:

答案 0 :(得分:2)

您需要的是公共附加列表,例如https://publicsuffix.org/处提供的列表。基本上,没有算法可以告诉你哪些后缀是公开的,所以你需要一个列表。你最好使用一个公开的,维护良好的。

答案 1 :(得分:1)

只是遇到了这个问题,并决定编写以下函数。

示例输入->输出:

http://example.com  -> http://example.com
http://www.example.com  -> http://example.com
ftp://www.a.example.com -> ftp://example.com
SFTP://www.a.example.com    -> SFTP://example.com
http://www.a.b.example.com  -> http://example.com
http://www.a.c.d.example.com    -> http://example.com
http://example.com/ -> http://example.com/
https://example.com/aaa -> http://example.com/aaa
http://www.example.com/aa/bb../d    -> http://example.com/aa/bb../d
FILE://www.a.example.com/ddd/dd/../ff   -> FILE://example.com/ddd/dd/../ff
HTTPS://www.a.b.example.com/index.html?param=value  -> HTTPS://example.com/index.html?param=value
http://www.a.c.d.example.com/#yeah../..!    -> http://lmao.com/#yeah../..!

Same goes for second level domains
http://some.thing.co.uk/?ke - http://thing.co.uk/?ke
something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk - something.co.uk
https://www.something.co.uk - https://something.co.uk

代码:

public static String removeSubdomains(String url, ArrayList<String> secondLevelDomains) {
    // We need our URL in three parts, protocol - domain - path
    String protocol= getProtocol(url);      
    url = url.substring(protocol.length());
    String urlDomain=url;
    String path="";
    if(urlDomain.contains("/")) {
        int slashPos = urlDomain.indexOf("/");
        path=urlDomain.substring(slashPos);
        urlDomain=urlDomain.substring(0, slashPos);
    }
    // Done, now let us count the dots . . 
    int dotCount = Strng.countOccurrences(urlDomain, ".");
    // example.com <-- nothing to cut
    if(dotCount==1){
        return protocol+url;
    }
    int dotOffset=2; // subdomain.example.com <-- default case, we want to remove everything before the 2nd last dot
    // however, somebody had the glorious idea, to have second level domains, such as co.uk
    for (String secondLevelDomain : secondLevelDomains) {
        // we need to check if our domain ends with a second level domain
        // example: something.co.uk we don't want to cut away "something", since it isn't a subdomain, but the actual domain
        if(urlDomain.endsWith(secondLevelDomain)) {
            // we increase the dot offset with the amount of dots in the second level domain (co.uk = +1)
            dotOffset += Strng.countOccurrences(secondLevelDomain, ".");
            break;
        }
    }
    // if we have something.co.uk, we have a offset of 3, but only 2 dots, hence nothing to remove
    if(dotOffset>dotCount) {
        return protocol+urlDomain+path;
    }
    // if we have sub.something.co.uk, we have a offset of 3 and 3 dots, so we remove "sub"
    int pos = Strng.nthLastIndexOf(dotOffset, ".", urlDomain)+1;
    urlDomain = urlDomain.substring(pos);   
    return protocol+urlDomain+path;
}

public static String getProtocol(String url) {
    String containsProtocolPattern = "^([a-zA-Z]*:\\/\\/)|^(\\/\\/)";
    Pattern pattern = Pattern.compile(containsProtocolPattern);
    Matcher m = pattern.matcher(url);
    if (m.find()) {       
        return m.group();
    }
    return "";
}

public static ArrayList<String> getPublicSuffixList(boolean loadFromPublicSufficOrg) {
    ArrayList<String> secondLevelDomains = new ArrayList<String>();
    if(!loadFromPublicSufficOrg) {
        secondLevelDomains.add("co.uk");secondLevelDomains.add("co.at");secondLevelDomains.add("or.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("gv.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("ac.uk");secondLevelDomains.add("gov.uk");secondLevelDomains.add("ltd.uk");secondLevelDomains.add("fed.us");secondLevelDomains.add("isa.us");secondLevelDomains.add("nsn.us");secondLevelDomains.add("dni.us");secondLevelDomains.add("ac.ru");secondLevelDomains.add("com.ru");secondLevelDomains.add("edu.ru");secondLevelDomains.add("gov.ru");secondLevelDomains.add("int.ru");secondLevelDomains.add("mil.ru");secondLevelDomains.add("net.ru");secondLevelDomains.add("org.ru");secondLevelDomains.add("pp.ru");secondLevelDomains.add("com.au");secondLevelDomains.add("net.au");secondLevelDomains.add("org.au");secondLevelDomains.add("edu.au");secondLevelDomains.add("gov.au");
    }
    try {
        String a = URLHelpers.getHTTP("https://publicsuffix.org/list/public_suffix_list.dat", false, true);
        Scanner scanner = new Scanner(a);
        while (scanner.hasNextLine()) {
        String line = scanner.nextLine();
            if(!line.startsWith("//") && !line.startsWith("*") && line.contains(".")) {
                secondLevelDomains.add(line);
            }
        }
        scanner.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return secondLevelDomains;
}