我正在编写一个缓存它可以找到的每个网页的程序。它的工作原理是将网站缓存到一个文件中,然后查找该文件中的所有有效URL。然后,它以递归方式扫描所有有效的URL。问题是,我找不到正则表达式或找到有效URL的方法。到目前为止,这是我的代码:
public static void findAllPages(String baseURL) throws Exception {
URL url = new URL(baseURL);
BufferedReader bf = new BufferedReader(new InputStreamReader(url.openStream()));
String cnt = ""; //HTML content read from URL
String ln; //Line
while((ln = bf.readLine()) != null) { //Read content
cnt += (ln + "\n");
}
int count = 0;
ArrayList<String> val = findUrlsInString(baseURL)
count = val.size();
for(int i = 0;i < count;i++) { //Find content of links on page
try {
findAllPages(val.get(i));
}catch(Exception e) {
//Invalid URL
}
}
}
public static void findUrlsInString(String url) {
//Need to filter out URLs here and put them in an ArrayList
}
注意:上面的代码中没有读/写文件