Question

我正在构建一个Web爬虫，我从起始页面获取HTML，然后将其交给另一个从中获取链接的方法。我假设我总是使用相同的页面，所以我只是构建它来处理HTML。问题是该方法返回重复链接，我无法弄清楚原因。我检查了我正在拉入的HTML，这是正确的，所以问题在于这个方法。这是代码：

public static ArrayList<String> linkParser(String htmlContents) {

        ArrayList<String> links = new ArrayList<String>();
        int start = 0;
        boolean done = false;
        while (start < htmlContents.length() && !done) {

            int startIndex = htmlContents.indexOf("<A HREF", start);
            if (startIndex != -1) {
                startIndex += 9;
                String currentLink = "";
                int i = startIndex;

                while (htmlContents.charAt(i) != '"') {
                    currentLink += htmlContents.charAt(i);
                    start++;
                    i++;
                }

                links.add(currentLink);
            } else {
                done = true;
            }
        }

        return links;
    }

这是我打电话时的输出：

[http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, foo://www.cs.uwec.edu/~stevende/foo/default.htm, foo://www.cs.uwec.edu/~stevende/foo/default.htm, http://www.foo.cs.uwec.edu/~stevende/cs145testpages/default.htm, http://www.foo.cs.uwec.edu/~stevende/cs145testpages/default.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/]

Here is the page I'm using.非常感谢任何帮助！

Answer 1

此代码可以使用

public static ArrayList linkParser（String htmlContents）{

ArrayList links = new ArrayList（）; int start = 0;

    boolean done = false;
    while (!done) {
        htmlContents = htmlContents.substring(start);
        int startIndex = htmlContents.indexOf("<A HREF");
        if (startIndex != -1) {
            startIndex += 9;
            String currentLink = "";
            while (htmlContents.charAt(startIndex) != '"') {
                currentLink += htmlContents.charAt(startIndex);
                startIndex++;
            }
            start = startIndex;
            links.add(currentLink);
        } else {
            done = true;
        }
    }

    return links;
}

从html获取链接返回重复链接的方法

1 个答案: