我是编程新手并且对http知之甚少,但我编写了一个用Java抓取网站的代码,并且遇到了我的代码擦除“获取”http调用的问题(基于输入URL) )但我不知道如何为“post”http呼叫抓取数据。
在对http进行简要概述后,我相信我需要模拟浏览器,但不知道如何在Java中执行此操作。 The website我一直试图使用。
由于我需要抓取所有页面的源代码,因此在单击每个下一个按钮时URL不会更改。我已经使用Firefox firebug来查看单击按钮时发生了什么,但我不知道我正在寻找的所有内容。
我目前截获数据的代码是:
public class Scraper {
private static String month = "11";
private static String day = "4";
private static String url = "http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d"+month+"%2f"+day+"%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27"; // the input website to be scraped
public static String sourcetext; //The source code that has been scraped
//scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
public static void scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode
while ((inputLine = in.readLine()) != null)
sourcecode.append(inputLine);
in.close();
sourcetext = sourcecode.toString();
}
为每个“帖子”调用抓取所有页面的最佳方法是什么?