Question

我正在尝试使用以下代码查看网址的内容类型。

有趣的是，即使它是PDF文档，给定URL（http://www.jbssinc.com/inv_pr_pdf/2007-05-08.pdf"）的内容类型也会返回为text/html; charset=iso-8859-1。我想了解原因。

这是我的代码：

public static void main(String[] args) throws MalformedURLException{
    URLConnection urlConnection = null;
    URL url  = new URL("http://www.jbssinc.com/inv_pr_pdf/2007-05-08.pdf");
    try {
        urlConnection = url.openConnection();
        urlConnection.setConnectTimeout(10*1000);
        urlConnection.setReadTimeout(10*1000);
        urlConnection.connect();

    } catch (IOException e) {
        System.out.println("Error in establishing connection.\n");
    }
    String contentType = "";
    /* If we were able to get a connection ---> */
    if (urlConnection != null) {
        contentType = urlConnection.getContentType();
    }
    System.out.println(contentType);
}

Answer 1

当我在Java中访问此页面时，如果我尝试实际加载页面，则会出现 403 - 禁止错误。这些错误页面是HTML页面，而不是pdf文件，这就是您获取您所看到的内容类型的原因。

此网站可能会检测您的浏览器或使用其他一些机制来阻止自动下载，这就是为什么它可以在Chrome，Firefox和IE中使用，但不适用于Java。

您的代码可以使用其他网址，例如https://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf。

对于此网络服务器，如果您将User-Agent指定为典型的浏览器值，则可以正常连接。

尝试在urlConnection.connect()：

之前立即添加此行

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");

See this answer for more information about setting the User-Agent。但是，在执行此操作之前，请确保您没有以某种方式违反网站的服务条款。

通常，检查网站是否明确禁止应用下载其内容的方法是使用http://example.com/robots.txt文件。在这种情况下，那将是http://www.jbssinc.com/robots.txt。在这种情况下，此文件不禁止机器人（您的程序）下载此特定文件，因此我认为您可以欺骗您的用户代理。在这种情况下，Java被阻止的事实更可能是用户错误。

进一步阅读：Is using a faked user agent allowed?

为什么我将PDF格式的内容类型作为HTML返回？

1 个答案: