我正在构建一个Web爬虫,我从起始页面获取HTML,然后将其交给另一个从中获取链接的方法。我假设我总是使用相同的页面,所以我只是构建它来处理HTML。问题是该方法返回重复链接,我无法弄清楚原因。我检查了我正在拉入的HTML,这是正确的,所以问题在于这个方法。 这是代码:
public static ArrayList<String> linkParser(String htmlContents) {
ArrayList<String> links = new ArrayList<String>();
int start = 0;
boolean done = false;
while (start < htmlContents.length() && !done) {
int startIndex = htmlContents.indexOf("<A HREF", start);
if (startIndex != -1) {
startIndex += 9;
String currentLink = "";
int i = startIndex;
while (htmlContents.charAt(i) != '"') {
currentLink += htmlContents.charAt(i);
start++;
i++;
}
links.add(currentLink);
} else {
done = true;
}
}
return links;
}
这是我打电话时的输出:
[http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/page1.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, page2.htm, foo://www.cs.uwec.edu/~stevende/foo/default.htm, foo://www.cs.uwec.edu/~stevende/foo/default.htm, http://www.foo.cs.uwec.edu/~stevende/cs145testpages/default.htm, http://www.foo.cs.uwec.edu/~stevende/cs145testpages/default.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.cs.uwec.edu/~stevende/cs145testpages/foo.htm, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/, http://www.goduke.com/]
Here is the page I'm using.非常感谢任何帮助!
答案 0 :(得分:1)
此代码可以使用
public static ArrayList linkParser(String htmlContents){
ArrayList links = new ArrayList(); int start = 0;
boolean done = false;
while (!done) {
htmlContents = htmlContents.substring(start);
int startIndex = htmlContents.indexOf("<A HREF");
if (startIndex != -1) {
startIndex += 9;
String currentLink = "";
while (htmlContents.charAt(startIndex) != '"') {
currentLink += htmlContents.charAt(startIndex);
startIndex++;
}
start = startIndex;
links.add(currentLink);
} else {
done = true;
}
}
return links;
}