根据需要,我尝试将doc或docx(Microsoft word)文件转换为html格式Apache tika
我最终得到以下代码,工作正常, 但它没有添加任何样式表来生成html。
import javax.xml.transform.OutputKeys;
import java.io.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.detect.DefaultDetector;
public class DocxConvert
{
public static void main(String []args)
{
InputStream input=null;
try
{
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"yes");
handler.setResult(new StreamResult(sw));
input = new FileInputStream("f:\\file.doc");
DefaultDetector detector = new DefaultDetector();
Metadata metadata = new Metadata();
org.apache.tika.parser.Parser parser = new AutoDetectParser(detector);
parser.parse(input, handler, metadata, new ParseContext());
System.out.print(sw.toString());
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally {
try {
input.close();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
有没有办法添加/生成样式表到输出?请帮助!
答案 0 :(得分:0)
你可以使用unoconv,它需要Openoffice或Libreoffice。从here下载,它提供doc,docx,xls等来从服务器命令行进行pdf转换。如果你想显示使用apache或apache tomcat嵌入pdf文件,我认为pdf.js是很好的解决方案。
答案 1 :(得分:0)
我使用了Tika的1.6版,这对我来说很好。这是我使用的pom依赖。
http://tika.apache.org/download.html
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.6</version>
</dependency>
</dependencies>