我正在尝试使用jsoup解析html页面。我检查每个元素的contentType,并希望打印所有不是text / html类型的元素。我在获取元素的内容类型后使用模式匹配。使用上面的代码,我看到text / html类型的元素正在打印
import java.io.*;
import java.net.*;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.apache.commons.validator.routines.UrlValidator;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class HelloWorld {
public static void main(String[] args) {
String url;
UrlValidator urlValidator = new UrlValidator();
url = "https://www.google.com";
Document doc = Jsoup.connect(url).get(); //parse the html code pointed by url
Elements links = doc.select("a[href]");
for (Element link : links) {
if(urlValidator.isValid(link.attr("href"))) { //check if the element is a url
URL portfolio_url = new URL(link.attr("href"));
URLConnection c = portfolio_url.openConnection();
String link_type = c.getContentType();
System.out.println(link_type);
if(link_type != null) {
Pattern pattern = Pattern.compile(link_type, Pattern.CASE_INSENSITIVE); // case-insensitive matching
Matcher matcher = pattern.matcher("text/html");
if(matcher.find() != true) {
System.out.println(link.attr("href"));
}
}
}
}
}
}
答案 0 :(得分:0)
您可以使用linux实用程序Wget
来实现此目的wget -r www.mytargetsite.com
然后运行下面显示所有网址的命令,
find www.mytargetsite.com
这里是示例输出
$ wget -r www.blackorange.biz
$ find www.blackorange.biz/
www.blackorange.biz/
www.blackorange.biz/services.html
www.blackorange.biz/contact.html
www.blackorange.biz/images
www.blackorange.biz/images/projectimg1.jpg
注意:这也会为您下载所有页面