Question

我只想下载内容类型为“text / html”的网站，不要下载pdf / mp4 / rar ...文件

现在我的代码是这样的：

 Connection connection = Jsoup.connect(linkInfo.getLink()).followRedirects(false).validateTLSCertificates(false).userAgent(USER_AGENT);

 Document htmlDocument = connection.get();

 if (!connection.response().contentType().contains("text/html")) {

     return;
 }

有没有像：

Jsoup.connect(linkInfo.getLink()).contentTypeOnly("text/html");

Answer 1

如果您的意思是在实际下载文件之前需要知道文件是否为HTML，那么您可以使用HEAD请求。这将只请求标题，因此您可以在实际下载文件之前检查它是否为text/html。您正在使用的方法并不真正起作用，因为您正在下载文件并在检查之前将其解析为HTML ，这将在非HTML文件上引发异常。

Connection connection = Jsoup.connect(linkInfo.getLink()) .method(Connection.Method.HEAD) .validateTLSCertificates(false) .followRedirects(false) .userAgent(USER_AGENT); Connection.Response head = connection.execute(); if (!head.contentType().contains("text/html")) return; Document html = Jsoup.connect(head.url()) .validateTLSCertificates(false) .followRedirects(false) .userAgent(USER_AGENT) .get();

如何使用jsoup获取带有html类型的url

1 个答案: