从html获取链接返回重复链接的方法

时间:2014-04-17 04:16:48

标签: java html methods hyperlink

我正在构建一个Web爬虫,我从起始页面获取HTML,然后将其交给另一个从中获取链接的方法。我假设我总是使用相同的页面,所以我只是构建它来处理HTML。问题是该方法返回重复链接,我无法弄清楚原因。我检查了我正在拉入的HTML,这是正确的,所以问题在于这个方法。 这是代码:

public static ArrayList<String> linkParser(String htmlContents) {

        ArrayList<String> links = new ArrayList<String>();
        int start = 0;
        boolean done = false;
        while (start < htmlContents.length() && !done) {

            int startIndex = htmlContents.indexOf("<A HREF", start);
            if (startIndex != -1) {
                startIndex += 9;
                String currentLink = "";
                int i = startIndex;

                while (htmlContents.charAt(i) != '"') {
                    currentLink += htmlContents.charAt(i);
                    start++;
                    i++;
                }

                links.add(currentLink);
            } else {
                done = true;
            }
        }

        return links;
    }

这是我打电话时的输出:

[http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, foo://www.cs.uwec.edu/~stevende/foo/default.htm, foo://www.cs.uwec.edu/~stevende/foo/default.htm, http://www.foo.cs.uwec.edu/~stevende/cs145testpages/default.htm, http://www.foo.cs.uwec.edu/~stevende/cs145testpages/default.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/]

Here is the page I'm using.非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

此代码可以使用

public static ArrayList linkParser(String htmlContents){

ArrayList links = new ArrayList();         int start = 0;

    boolean done = false;
    while (!done) {
        htmlContents = htmlContents.substring(start);
        int startIndex = htmlContents.indexOf("<A HREF");
        if (startIndex != -1) {
            startIndex += 9;
            String currentLink = "";
            while (htmlContents.charAt(startIndex) != '"') {
                currentLink += htmlContents.charAt(startIndex);
                startIndex++;
            }
            start = startIndex;
            links.add(currentLink);
        } else {
            done = true;
        }
    }

    return links;
}