如何使用Tika的XWPFWordExtractorDecorator类?

时间:2012-01-29 05:35:28

标签: java apache-poi

有人告诉我,Tika的XWPFWordExtractorDecorator类用于将docx转换为html。但我不确定如何使用此类从docx获取HTML。任何其他图书馆也可以做同样的工作/

1 个答案:

答案 0 :(得分:4)

你不应该直接使用它

相反,以通常的方式调用Tika,它会为您调用适当的代码

如果您希望XHTML解析文件,代码看起来像

    // Either of these will work, the latter is recommended
    //InputStream input = new FileInputStream("test.docx");
    InputStream input = TikaInputStream.get(new File("test.docx"));

    // AutoDetect is normally best, unless you know the best parser for the type
    Parser parser = new AutoDetectParser();

    // Handler for indented XHTML
    StringWriter sw = new StringWriter();
    SAXTransformerFactory factory = (SAXTransformerFactory)
             SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
    handler.setResult(new StreamResult(sw));

    // Call the Tika Parser
    try {
        Metadata metadata = new Metadata();
        parser.parse(input, handler, metadata, new ParseContext());
        String xml = sw.toString();
    } finally {
        input.close();
    }