Question

我是编程新手并且对http知之甚少，但我编写了一个用Java抓取网站的代码，并且遇到了我的代码擦除“获取”http调用的问题（基于输入URL））但我不知道如何为“post”http呼叫抓取数据。

在对http进行简要概述后，我相信我需要模拟浏览器，但不知道如何在Java中执行此操作。 The website我一直试图使用。

由于我需要抓取所有页面的源代码，因此在单击每个下一个按钮时URL不会更改。我已经使用Firefox firebug来查看单击按钮时发生了什么，但我不知道我正在寻找的所有内容。

我目前截获数据的代码是：

public class Scraper { 
  private static String month = "11";
  private static String day = "4";
  private static String url = "http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d"+month+"%2f"+day+"%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27"; // the input website to be scraped

  public static String sourcetext; //The source code that has been scraped

  //scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
  public static void scrapeWebsite() throws IOException {

    URL urlconnect = new URL(url); //creates the url from the variable
    URLConnection connection = urlconnect.openConnection(); 
    BufferedReader in = new BufferedReader(new InputStreamReader( 
                                                                 connection.getInputStream(), "UTF-8")); 
    String inputLine; 
    StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode

    while ((inputLine = in.readLine()) != null)
      sourcecode.append(inputLine);
    in.close();
    sourcetext = sourcecode.toString(); 
  }

为每个“帖子”调用抓取所有页面的最佳方法是什么？

Answer 1

查看jersey client interface

查看每个页面的来源，并确定下一个页面然后循环的网址格式。

如何刮一个网站，http获取vs http帖子？

1 个答案: