Question

我正在编写一个缓存它可以找到的每个网页的程序。它的工作原理是将网站缓存到一个文件中，然后查找该文件中的所有有效URL。然后，它以递归方式扫描所有有效的URL。问题是，我找不到正则表达式或找到有效URL的方法。到目前为止，这是我的代码：

public static void findAllPages(String baseURL) throws Exception {
    URL url = new URL(baseURL);
    BufferedReader bf = new BufferedReader(new InputStreamReader(url.openStream()));

    String cnt = ""; //HTML content read from URL
    String ln;  //Line

    while((ln = bf.readLine()) != null) {  //Read content
        cnt += (ln + "\n");
    }

    int count = 0;

    ArrayList<String> val = findUrlsInString(baseURL)

    count = val.size();

    for(int i = 0;i < count;i++) {  //Find content of links on page
        try {
            findAllPages(val.get(i));
        }catch(Exception e) {
            //Invalid URL
        }
    }
}

public static void findUrlsInString(String url) {
    //Need to filter out URLs here and put them in an ArrayList
}

注意：上面的代码中没有读/写文件

Answer 1

您应该使用一些html解析器而不是regexp。这种解析器的一个例子是jsoup

如何创建正则表达式以在网页中查找有效的URL？

1 个答案: