从任何网页中提取所有网址,无法使用indexOf [homework]

时间:2014-04-11 22:46:40

标签: java indexof

indexOf总是返回负7,无论我放什么,我将使用网站http://www.columbusstate.edu

import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Arrays;
public class WebCrawler
{
    private static int linkCount = 0;
    public static void main(String[] args) throws IOException
    {

实例变量

        ArrayList<String> links = new ArrayList<String>();
        System.out.println("Enter the website you would like to web crawl");
        Scanner input = new Scanner(System.in);
        String address=input.next();

转到网站

        URL locator = new URL(address);
        Scanner in=new Scanner(locator.openStream());

        String str="";
        PrintWriter out=new PrintWriter("links.txt");

搜索网页并提取链接,或者无论如何都应该这样做。

        while(in.hasNextLine())
        {
            str=in.next();
            if(str.contains("href=\"http://"))
            {   
                linkCount++;
                int start = str.indexOf("ht");
                int end = str.indexOf("/\"");
                if(links.contains(str.substring(start, end))){

                }
                else{
                     links.add("Line Number "+linkCount+""+str.substring(start, end));
                }
            }
            else if(str.contains("href=\"https://")){
                linkCount++;
                int start = str.indexOf("ht");
                int end = str.indexOf("://")+15;
                if(links.contains(str.substring(start, end))){

                }
                else{
                    links.add("Line Number "+linkCount+""+str.substring(start, end));
                }
            }
        }
        int num = links.size();
        System.out.println(num);
        out.println("Number of links on this webpage is "+linkCount);
        out.println("Links are:");
        for(int i = links.size()-1; i>0; i--){
           out.println(links.get(i)); 
        }
        out.close();
    }
}

1 个答案:

答案 0 :(得分:0)

如果您真的想要从网页中提取链接,那么使用正确的HTML解析器比尝试手动执行更好。这是JSOUP

的示例
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLUtils {
  private HTMLUtils() {}

  public static List<String>extractLinks(String url) throws IOException {
    final ArrayList<String> result = new ArrayList<String>();

    Document doc = Jsoup.connect(url).get();

    Elements links = doc.select("a[href]");

    // href ...
    for (Element link : links) {
      result.add(link.attr("abs:href"));
      // result.add(link.text());
    }
    return result;
  }


  public final static void main(String[] args) throws Exception{
    String site = "http://www.columbusstate.edu";
    List<String> links = HTMLUtils.extractLinks(site);
    for (String link : links) {
      System.out.println(link);
    }
  }
}