无法计算从非空XHTML生成的docx文件中的字符数

时间:2015-01-28 14:27:41

标签: java jaxb xhtml docx docx4j

我使用DocX4J为DocX实现了一个XHTML转换器。它可以毫无问题地创建DocX文件。

为了完成我的任务,我决定实施一个简单的测试。测试包括计算创建的DocX中的字符数字,然后将其与XHTML中已知的字符数进行比较(请参阅下面的源代码)。

我的测试代码基于来自DocX4J网站的示例,但对我不起作用。虽然我可以看到我的转换器创建的DocX的内容等于XHTML文件的内容,但我的测试代码总是返回0到DocX文件的字符数。 : - \

有没有人可以帮我发现这个意外结果的原因?

提前致谢!

package main;

import java.io.File;
import java.io.IOException;
import java.io.StringWriter;

import org.docx4j.TextUtils;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.contenttype.ContentType;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.exceptions.InvalidFormatException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.PartName;
import org.docx4j.openpackaging.parts.WordprocessingML.AlternativeFormatInputPart;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.relationships.Relationship;
import org.docx4j.wml.CTAltChunk;
import org.docx4j.wml.Document;

/**
 * Count chars from a DocX file generated from a XHTML using Docx4J
 * 
 * @author Cláudio
 */
public class CountChars {

    public static void main(String[] args) {
	String xhtml = "<html><body><table border=\"1\"><tr><td>Propriedade</td><td>Amostra 1</td><td>Amostra 2</td></tr><tr><td>Prop1</td><td>10.0</td><td>111.0</td></tr><tr><td>Prop2</td><td>20.0</td><td>222.0</td></tr></table></body></html>";
	int expectedNChars = 57;

	WordprocessingMLPackage docx = export(xhtml);
	try {
	    docx.save(new File("test.docx")); // Proves that docx is
		                              // successfully created
	} catch (Docx4JException e) {
	    // TODO Auto-generated catch block
	    e.printStackTrace();
	}

	if (countCharacters(docx) == expectedNChars) {
	    System.out.println("Success");
	} else {
	    System.out.println("Fail");
	}
    }

    private static WordprocessingMLPackage export(String xhtml) {
	WordprocessingMLPackage wordMLPackage = null;
	AlternativeFormatInputPart afiPart = null;
	Relationship altChunkRel = null;

	try {
	    wordMLPackage = WordprocessingMLPackage.createPackage();
	    afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
	} catch (InvalidFormatException e) {
	    // TODO Auto-generated catch block
	    e.printStackTrace();
	}

	afiPart.setBinaryData(xhtml.getBytes());
	afiPart.setContentType(new ContentType("text/html"));

	try {
	    altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(
		    afiPart);
	} catch (InvalidFormatException e) {
	    // TODO Auto-generated catch block
	    e.printStackTrace();
	}

	// .. the bit in document body
	CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
	ac.setId(altChunkRel.getId());
	wordMLPackage.getMainDocumentPart().addObject(ac);

	// .. content type
	wordMLPackage.getContentTypeManager().addDefaultContentType("html",
	        "text/html");

	return wordMLPackage;
    }

    /**
     * Counts chars (even whitespaces) in a docx.
     * 
     * Referência:
     * http://www.docx4java.org/forums/docx-java-f6/how-to-count-number
     * -of-characters-in-a-docx-file-t767.html
     * 
     * @param docx
     *            Document
     * 
     * @return Number of chars in the document
     */
    private static int countCharacters(WordprocessingMLPackage docx) {
	String strString = null;

	MainDocumentPart documentPart = docx.getMainDocumentPart();
	Document wmlDocument = documentPart.getJaxbElement();

	StringWriter strWriter = null;
	try {
	    strWriter = new StringWriter();
	    TextUtils.extractText(wmlDocument, strWriter);
	    strString = strWriter.toString();
	} catch (Exception e) {
	    // TODO Auto-generated catch block
	    e.printStackTrace();
	} finally {
	    if (strWriter != null) {
		try {
		    strWriter.close();
		} catch (IOException e) {
		    // TODO Auto-generated catch block
		    e.printStackTrace();
		}
	    }
	}

	if (strString == null) {
	    throw new NullPointerException();
	}

	return strString.length();
    }
}

2 个答案:

答案 0 :(得分:1)

您正在将XHTML添加为AlternativeFormatInputPart(AFIP),这通常会将其转换为Word以将XHTML转换为真实的docx内容。

与此同时,XHTML内容不在MainDocumentPart documentPart中,而是在AFIP中。因此,计算documentPart中的单词当然不会给你你想要的东西......

答案 1 :(得分:0)

使用docx4j 2.8.1正确实现方法导出应该如下:

private static WordprocessingMLPackage export(String xhtml) {
WordprocessingMLPackage wordMLPackage = null;

try {
    wordMLPackage = WordprocessingMLPackage.createPackage();
    List<Object> content = XHTMLImporter.convert(xhtml, null,
        wordMLPackage);
    wordMLPackage.getMainDocumentPart().getContent().addAll(content);
} catch (Docx4JException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

return wordMLPackage;
}