如何拆分URL?

时间:2016-07-28 10:39:51

标签: java

这是我分割网址的代码,但该代码存在问题。所有链接都显示为双字,例如 www.utem.edu.my/portal/portal 。单词/ portal / portal在任何链接中总是加倍。有什么建议我在网页上提取链接吗?

public String crawlURL(String strUrl) {
    String results = ""; // For return
    String protocol = "http://";

    // Assigns the input to the inURL variable and checks to add http
    String inURL = strUrl;
    if (!inURL.toLowerCase().contains("http://".toLowerCase()) && 
            !inURL.toLowerCase().contains("https://".toLowerCase())) {
        inURL = protocol + inURL;
    }

    // Pulls URL contents from the web
    String contectURL = pullURL(inURL);
    if (contectURL == "") { // If it fails, then try with https
        protocol = "https://";
        inURL = protocol + inURL.split("http://")[1];
        contectURL = pullURL(inURL);
    }

    // Declares some variables to be used inside loop
    String aTagAttr = "";
    String href = "";
    String msg = "";

    // Finds A tag and stores its href value into output var
    String bodyTag = contectURL.split("<body")[1]; // Find 1st <body>
    String[] aTags = bodyTag.split(">"); // Splits on every tag

    //To show link different from one another
    int index = 0;

    for (String s: aTags) {
    // Process only if it is A tag and contains href
    if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) {

        aTagAttr = s.split("href")[1]; // Split on href

        // Split on space if it contains it
        if (aTagAttr.toLowerCase().contains("\\s"))
            aTagAttr = aTagAttr.split("\\s")[2];

        // Splits on the link and deals with " or ' quotes
        href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1];

        if (!results.toLowerCase().contains(href)) 
            //results += "~~~ " + href + "\r\n";

        /*
        * Last touches to URl before display
        *      Adds http(s):// if not exist
        *      Adds base url if not exist
        */

        if(results.toLowerCase().indexOf("about") != -1) {
            //Contains 'about'
        }
        if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) {

            // http:// + baseURL + href
            if (!href.toLowerCase().contains(inURL.split("://")[1]))
                href = protocol + inURL.split("://")[1] + href;
            else
                href = protocol + href;
        }

        System.out.println(href);//debug

1 个答案:

答案 0 :(得分:4)

考虑使用URL类...

按照文档的建议使用它: )

public static void main(String[] args) throws Exception {

        URL aURL = new URL("http://example.com:80/docs/books/tutorial"
                           + "/index.html?name=networking#DOWNLOADING");

        System.out.println("protocol = " + aURL.getProtocol());
        System.out.println("authority = " + aURL.getAuthority());
        System.out.println("host = " + aURL.getHost());
        System.out.println("port = " + aURL.getPort());
        System.out.println("path = " + aURL.getPath());
        System.out.println("query = " + aURL.getQuery());
        System.out.println("filename = " + aURL.getFile());
        System.out.println("ref = " + aURL.getRef());
    }
}

输出:

  

protocol = http

     

authority = example.com:80

     

host = example.com

     

port = 80

     

在此之后,您可以获取所需的元素,创建一个新的字符串/ URL:)