Question

我遵循了使用Java编写基本Web爬虫的教程，并获得了基本功能。

目前它只是从网站上检索HTML并将其打印到控制台。我希望扩展它，以便它可以过滤掉HTML页面标题和HTTP状态代码等细节？

我找到了这个库： http://htmlparser.sourceforge.net/ ...我认为可以为我做这个工作，但我可以不使用外部库吗？

这是我到目前为止所拥有的：

public static void main(String[] args) {

    // String representing the URL
    String input = "";

    // Check if argument added at command line
    if (args.length >= 1) {
        input = args[0];
    }

    // If no argument at command line use default
    else {
        input = "http://www.my_site.com/";
        System.out.println("\nNo argument entered so default of " + input
                + " used: \n");
    }
    // input test URL and read from file input stream
    try {

        testURL = new URL(input);
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                testURL.openStream()));

        // String variable to hold the returned content
        String line = "";

        // print content to console until no new lines of content
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (Exception e) {

        e.printStackTrace();
        System.out.println("Exception thrown");
    }
}

Answer 1

肯定有用于HTTP通信的工具。但是，如果您更喜欢自己实现一个 - 请查看java.net.HttpURLConnection。它将为您提供对HTTP通信的更细粒度的控制。这里有一些小样本：

public static void main(String[] args) throws IOException
{
  URL url = new URL("http://www.google.com");
  HttpURLConnection connection = (HttpURLConnection) url.openConnection();

  connection.setRequestMethod("GET");

  String resp = getResponseBody(connection);

  System.out.println("RESPONSE CODE: " + connection.getResponseCode());
  System.out.println(resp);
}

private static String getResponseBody(HttpURLConnection connection)
    throws IOException
{
  try
  {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        connection.getInputStream()));

    StringBuilder responseBody = new StringBuilder();
    String line = "";

    while ((line = reader.readLine()) != null)
    {
      responseBody.append(line + "\n");
    }

    reader.close();
    return responseBody.toString();
  }
  catch (IOException e)
  {
    e.printStackTrace();
    return "";
  }
}

扩展基本Web爬网程序以过滤状态代码和HTML

1 个答案: