使用iText

时间:2018-02-08 18:53:26

标签: html pdf itext tidy

我使用iText将HTML转换为PDF。 HTML可以有任何语言,因此对于测试我有以下HTML:

<html>
<head/>

<body>
  </div>
  <div dir="ltr">
    <br />
    <div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">
      <div>
        ů&nbsp; Ů
      </div>
      <div>
        <br />
      </div>
      <div>
        š č ť ý á í é ú Ě Ň
      </div>
      <div>
        <br />
      </div>
      <div>
        δ ζ ψ φ ξ λ μ π ρ ε ς α σ&nbsp;
      </div>
      <div>
        <br />
      </div>
      <div>
        ΔΣΨΦΓΩΞΛΠΘ
      </div>
      <br />ı ğ ç ö ü ş İ Ğ Ç Ö Ü Ş
      <br />
      <br />Č Ć Š Ž Đ
      <br />
    </div>
  </div>
</body>

</html>

我使用“Tidy”来清理和处理这个HTML:

public String convertToXHTML1(String htmlText) throws FileNotFoundException {
    InputStream is = new ByteArrayInputStream(htmlText.getBytes());
    OutputStream os = new OutputStream() {
        private StringBuilder string = new StringBuilder();

        @Override
        public void write(int b) throws IOException {
            this.string.append((char) b);
        }

        public String toString() {
            return this.string.toString();
        }
    };
    Tidy tidy = new Tidy();
    tidy.setXHTML(true);

    tidy.setQuiet(false);
    tidy.setShowWarnings(true);
    tidy.setShowErrors(1);
    tidy.setMakeClean(true);
    tidy.setForceOutput(true);
    tidy.setNumEntities(false);
    tidy.setInputEncoding("utf8");
    tidy.setOutputEncoding("raw");
    //tidy.setRawOut(true);

    org.w3c.dom.Document doc = tidy.parseDOM(is, os);//in,out);
    String xhtmlText = os.toString();
    return xhtmlText;
}

和resualt是:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<meta charset="utf-8" />
<title></title>
</head>
<body>
<div dir="ltr"><br />
 
<div
style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">
<div>&#367;&nbsp; &#366;</div>
<div><br />
</div>
<div>&scaron; &#269; &#357; &yacute; &aacute; &iacute; &eacute;
&uacute; &#282; &#327;</div>
<div><br />
</div>
<div>&delta; &zeta; &psi; &phi; &xi; &lambda; &mu; &pi; &rho;
&epsilon; &sigmaf; &alpha; &sigma;&nbsp;</div>
<div><br />
</div>
<div>
&Delta;&Sigma;&Psi;&Phi;&Gamma;&Omega;&Xi;&Lambda;&Pi;&Theta;</div>
</div>
<br />
&#305; &#287; &ccedil; &ouml; &uuml; &#351; &#304; &#286; &Ccedil;
&Ouml; &Uuml; &#350;<br />
<br />
&#268; &#262; &Scaron; &#381; &#272;<br />
</div>
</body>
</html>

然后我使用iText将PDF的ByteArrayOutputStream保存为谷歌驱动器上的pdf文件。

public static ByteArrayOutputStream createPdfFromHtml(String htmlBody) throws DocumentException {

    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try {

        Document document = new Document();//(PageSize.A4);
        PdfWriter writer = PdfWriter.getInstance(document, outputStream);
        document.open();
        log.info("convert the email message body (given as XHTML) to PDF stream");

        // CSS
        CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(true);

        // HTML
        HtmlPipelineContext htmlContext = new MySpecialImageProviderAwareHtmlPipelineContext();

        htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
        htmlContext.setImageProvider(new MyImageProvider());

        // Pipelines
        PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
        HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
        CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

        // XML Worker
        XMLWorker worker = new XMLWorker(css, true);
        XMLParser p = new XMLParser(worker);

        p.parse(new ByteArrayInputStream(htmlBody.getBytes("UTF-8")), Charset.forName("UTF-8"));

        document.close();

    } catch (DocumentException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }catch (Exception e){
        e.printStackTrace();
    }
    return  outputStream;
}

故事情节应如下:

ůŮ

ščťááéúĚŇ

δζψφξλμπρεςασ

ΔΣΨΦΓΩΞΛΠΘ

ığçöşİĞÇÖÜŞ

ČĆŠŽĐ

但我有以下结果:

ůŮ

ščťááéúĚŇ

δζψφξλμπρεςασ

ΔΣΨΦΓΩΞΛΠΘ

çöüÇÖÜ

ŠŽ

只是土耳其字符没有正确转换! 我不知道是否应该向iText添加新字体或问题是Tidy没有正确转换这些字符?

0 个答案:

没有答案