我使用iText将HTML转换为PDF。 HTML可以有任何语言,因此对于测试我有以下HTML:
<html>
<head/>
<body>
</div>
<div dir="ltr">
<br />
<div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">
<div>
ů Ů
</div>
<div>
<br />
</div>
<div>
š č ť ý á í é ú Ě Ň
</div>
<div>
<br />
</div>
<div>
δ ζ ψ φ ξ λ μ π ρ ε ς α σ
</div>
<div>
<br />
</div>
<div>
ΔΣΨΦΓΩΞΛΠΘ
</div>
<br />ı ğ ç ö ü ş İ Ğ Ç Ö Ü Ş
<br />
<br />Č Ć Š Ž Đ
<br />
</div>
</div>
</body>
</html>
我使用“Tidy”来清理和处理这个HTML:
public String convertToXHTML1(String htmlText) throws FileNotFoundException {
InputStream is = new ByteArrayInputStream(htmlText.getBytes());
OutputStream os = new OutputStream() {
private StringBuilder string = new StringBuilder();
@Override
public void write(int b) throws IOException {
this.string.append((char) b);
}
public String toString() {
return this.string.toString();
}
};
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(1);
tidy.setMakeClean(true);
tidy.setForceOutput(true);
tidy.setNumEntities(false);
tidy.setInputEncoding("utf8");
tidy.setOutputEncoding("raw");
//tidy.setRawOut(true);
org.w3c.dom.Document doc = tidy.parseDOM(is, os);//in,out);
String xhtmlText = os.toString();
return xhtmlText;
}
和resualt是:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<meta charset="utf-8" />
<title></title>
</head>
<body>
<div dir="ltr"><br />
<div
style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">
<div>ů Ů</div>
<div><br />
</div>
<div>š č ť ý á í é
ú Ě Ň</div>
<div><br />
</div>
<div>δ ζ ψ φ ξ λ μ π ρ
ε ς α σ </div>
<div><br />
</div>
<div>
ΔΣΨΦΓΩΞΛΠΘ</div>
</div>
<br />
ı ğ ç ö ü ş İ Ğ Ç
Ö Ü Ş<br />
<br />
Č Ć Š Ž Đ<br />
</div>
</body>
</html>
然后我使用iText将PDF的ByteArrayOutputStream保存为谷歌驱动器上的pdf文件。
public static ByteArrayOutputStream createPdfFromHtml(String htmlBody) throws DocumentException {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
try {
Document document = new Document();//(PageSize.A4);
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
log.info("convert the email message body (given as XHTML) to PDF stream");
// CSS
CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
// HTML
HtmlPipelineContext htmlContext = new MySpecialImageProviderAwareHtmlPipelineContext();
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new MyImageProvider());
// Pipelines
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
// XML Worker
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new ByteArrayInputStream(htmlBody.getBytes("UTF-8")), Charset.forName("UTF-8"));
document.close();
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}catch (Exception e){
e.printStackTrace();
}
return outputStream;
}
故事情节应如下:
ůŮ
ščťááéúĚŇ
δζψφξλμπρεςασ
ΔΣΨΦΓΩΞΛΠΘ
ığçöşİĞÇÖÜŞ
ČĆŠŽĐ
但我有以下结果:
ůŮ
ščťááéúĚŇ
δζψφξλμπρεςασ
ΔΣΨΦΓΩΞΛΠΘ
çöüÇÖÜ
ŠŽ
只是土耳其字符没有正确转换! 我不知道是否应该向iText添加新字体或问题是Tidy没有正确转换这些字符?