如何使用HtmlUnit从网站下载pdfLink? HtmlClient.getPage()的默认返回是一个HtmlPage。这不处理pdf文件。
答案 0 :(得分:1)
答案是,如果响应不是html文件,HtmlClient.getPage将返回UnexpectedPage。那么你可以将pdf作为输入流并保存。
private void grabPdf(String urlNow)
{
OutputStream outStream =null;
InputStream is = null;
try
{
if(urlNow.endsWith(".pdf"))
{
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
try
{
setWebClientOptions(webClient);
final UnexpectedPage pdfPage = webClient.getPage(urlNow);
is = pdfPage.getWebResponse().getContentAsStream();
String fileName = "myfilename";
fileName = fileName.replaceAll("[^A-Za-z0-9]", "");
File targetFile = new File(outputPath + File.separator + fileName + ".pdf");
outStream = new FileOutputStream(targetFile);
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = is.read(buffer)) != -1)
{
outStream.write(buffer, 0, bytesRead);
}
}
catch (Exception e)
{
NioLog.getLogger().error(e.getMessage(), e);
}
finally
{
webClient.close();
if(null!=is)
{
is.close();
}
if(null!=outStream)
{
outStream.close();
}
}
}
}
catch (Exception e)
{
NioLog.getLogger().error(e.getMessage(), e);
}
}
旁注。我没有使用try资源,因为输出流只能在try块中初始化。我可以分成两种方法,但程序员阅读的认知速度会慢得多。
答案 1 :(得分:1)
private boolean grabPdf(String url, File output) {
FileOutputStream outStream = null;
InputStream is = null;
try {
final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
try {
final UnexpectedPage pdfPage = webClient.getPage(url);
is = pdfPage.getWebResponse().getContentAsStream();
outStream = new FileOutputStream(output);
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = is.read(buffer)) != -1) {
outStream.write(buffer, 0, bytesRead);
}
return true;
} catch (Exception e) {
e.printStackTrace();
} finally {
if(webClient != null)
webClient.close();
if(is != null)
is.close();
if(outStream != null)
outStream.close();
}
} catch (Exception e) {
e.printStackTrace();
}
return false;
}
建议作为修改,但被拒绝。该答案通过以下方式对原始答案进行了改进:
boolean
是否已下载File
参数来保存文件,而不是在方法中对其进行硬编码