Question

如何使用HtmlUnit从网站下载pdfLink？ HtmlClient.getPage（）的默认返回是一个HtmlPage。这不处理pdf文件。

Answer 1

答案是，如果响应不是html文件，HtmlClient.getPage将返回UnexpectedPage。那么你可以将pdf作为输入流并保存。

private void grabPdf(String urlNow)
{
    OutputStream outStream =null;
    InputStream is = null;
    try
    {
        if(urlNow.endsWith(".pdf"))
        {
            final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
            try
            {
                setWebClientOptions(webClient);
                final UnexpectedPage pdfPage = webClient.getPage(urlNow);
                is = pdfPage.getWebResponse().getContentAsStream();

                String fileName = "myfilename";
                fileName = fileName.replaceAll("[^A-Za-z0-9]", "");

                File targetFile = new File(outputPath + File.separator + fileName  + ".pdf");
                outStream = new FileOutputStream(targetFile);
                byte[] buffer = new byte[8 * 1024];
                int bytesRead;
                while ((bytesRead = is.read(buffer)) != -1)
                {
                    outStream.write(buffer, 0, bytesRead);
                }


            }
            catch (Exception e)
            {
                NioLog.getLogger().error(e.getMessage(), e);
            }
            finally
            {
                webClient.close();
                if(null!=is)
                {
                    is.close();
                }
                if(null!=outStream)
                {
                    outStream.close();
                }
            }
        }
    }
    catch (Exception e)
    {
        NioLog.getLogger().error(e.getMessage(), e);
    }

}

旁注。我没有使用try资源，因为输出流只能在try块中初始化。我可以分成两种方法，但程序员阅读的认知速度会慢得多。

Answer 2

private boolean grabPdf(String url, File output) {
    FileOutputStream outStream = null;
    InputStream is = null;
    try {
        final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
        try {
            final UnexpectedPage pdfPage = webClient.getPage(url);
            is = pdfPage.getWebResponse().getContentAsStream();
            outStream = new FileOutputStream(output);
            byte[] buffer = new byte[8 * 1024];
            int bytesRead;
            while ((bytesRead = is.read(buffer)) != -1) {
                outStream.write(buffer, 0, bytesRead);
            }
            return true;
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if(webClient != null)
                webClient.close();
            if(is != null)
                is.close();
            if(outStream != null)
                outStream.close();
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return false;
}

建议作为修改，但被拒绝。该答案通过以下方式对原始答案进行了改进：

返回boolean是否已下载
使用不以.pdf结尾的链接的作品
采用一个File参数来保存文件，而不是在方法中对其进行硬编码
将FIREFOX更改为BEST_SUPPORTED，因为它是更通用的建议（但用户可能希望根据自己的需要进行更改）

HtmlUnit：保存pdf链接

2 个答案: