我用Java编写代码来检索和解析源代码。我试图访问的网站是: http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d9%2f30%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27
源代码仅适用于页面,即使总共有11页。要访问下一页的源代码,我必须单击下一步按钮,重新加载页面以查看新的源代码。我需要在我的代码中实现这个想法,让我的代码检索所有不同的源代码页。
我已经读过可能使用PhantomJS或CasperJS来做这件事,但我不知道如何实现这些。
我的代码如下:
// Scraper class takes an input of a string, and returns the source code of the of the website. Also picks out the needed data
public class Scraper {
private static String url; // the input website to be scraped
public static String sourcetext; //The source code that has been scraped
//constructor which allows for the input of a URL
public Scraper(String url) {
this.url = url;
}
//scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
public static void scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection(); // connects to the created URL
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
String inputLine; //creates a new variable of string
StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode
//loop appends to the string builder as long as there is information
while ((inputLine = in.readLine()) != null)
sourcecode.append(inputLine);// appends the source code to the sting
in.close();
sourcetext = sourcecode.toString(); // Takes the text in stringbuilder and converts it to a string
sourcetext = sourcetext.replace('"','*'); //deletes the quotes(") so it can be parsed
}
//This method parses through the data and adds the necesary information to a specified CSV file
public static void getPlaintiff() throws IOException {
PrintWriter docketFile = new PrintWriter("tester.csv", "UTF-8"); // creates the csv file. (name must be changed, override deletes file)
int i = 0;
//While loop runs through all the data in the source code. There is (14) entries per page.
while(i<14) {
String plaintiffAtty = "PlaintiffAtty_"+i+"*>"; //creates the search string for the plaintiffatty
Pattern plaintiffPattern = Pattern.compile("(?<="+Pattern.quote(plaintiffAtty)+").*?(?=</span>)");//creates the pattern for the atty
Matcher plaintiffMatcher = plaintiffPattern.matcher(sourcetext); // looks for a match for the atty
while (plaintiffMatcher.find()) {
docketFile.write(plaintiffMatcher.group().toString()+", "); //writes the found atty to the file
}
String appraisedValue = "Appraised_"+i+"*>"; //creats the search string for the appraised value
Pattern appraisedPattern = Pattern.compile("(?<="+Pattern.quote(appraisedValue)+").*?(?=</span>)");//creates the parren for the value
Matcher appraisedMatcher = appraisedPattern.matcher(sourcetext); //looks for a match to the apreaised value
while (appraisedMatcher.find()) {
docketFile.write(appraisedMatcher.group().toString()+"\n"); //writes the found value to the file
}
i++;
}
docketFile.close(); //closes the file
}
}
答案 0 :(得分:0)
这是您新的,重新格式化,重新设计和修改后的代码;既然它实际上是可以理解的,你或许可以解决自己的问题。 (如果您使用的是java 1.6或更早版本,则可能需要还原try-with-resources部分,因为它们仅在1.7中添加。)
/**
* This class contains methods for is for picking
* out needed data from the source of a website.
*/
public class Scraper {
/**
* This method scrapes the input URL.
* @return A string containing the data from the webpage.
* @throws IOException if there was a problem with accessing the website.
*/
public static String scrapeWebsite(String url) throws IOException {
String inputLine;
StringBuilder sourcetext = new StringBuilder();
URL urlconnect = new URL(url);
URLConnection connection = urlconnect.openConnection();
try(BufferedReader in = new BufferedReader(
new InputStreamReader(connection.getInputStream(), "UTF-8"))){
while ((inputLine = in.readLine()) != null)
sourcetext.append(inputLine);
}
return sourceText.toString().replace('"','*');
}
/**
* This method parses through the data and adds the necesary information to
* a specified .CSV file.
* @param source The datasource, for example that returned by
* {@link scrapeWebsite()}.
* @param targetFile The file path for the destination .csv file.
* @throws IOException if there was a problem with accessing the file.
*/
public static void getPlaintiff(CharSequence source, String targetFile)
throws IOException{
try(PrintWriter docketFile = new PrintWriter("tester.csv", "UTF-8")){
for(int i = 0; i < 14; i++) {
Matcher plaintiffMatcher = Pattern.compile(
"(?<=PlaintiffAtty_" + i + "\\*>).*?(?=</span>)")
.matcher(source);
while (plaintiffMatcher.find())
docketFile.println(plaintiffMatcher.group());
Matcher appraisedMatcher = Pattern.compile(
"(?<=Appraised_" + i + "\\*>).*?(?=</span>)")
.matcher(source);
while (appraisedMatcher.find())
docketFile.println(appraisedMatcher.group());
}
}
}
}
(请注意可能已经引入了新错误;只需修复它们,没什么大不了的。)
编辑:意识到匹配器的创建确实必须在循环内完成,因为生成正则表达式需要索引;还将docketWriter.write
替换为更简单的docketWriter.println
声明。