如何使用tika apache转换.doc .docx?

时间:2014-06-19 06:51:55

标签: java apache docx apache-tika

我想要求将doc docx转换为此处代码中的文件文本     在这里输入代码

 public DokumenExtractor(String filename) {
    context = new ParseContext();
    detector = new DefaultDetector();
    parser = new AutoDetectParser(detector);
    context.set(Parser.class, parser);
    outputstream = new ByteArrayOutputStream();
    metadata = new Metadata();

    try {
 process(filename);
 } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
     }

 }

  public void process(String filename) throws Exception {
      URL url;
     File file = new File(filename);
    if (file.isFile()) {
        url = file.toURI().toURL();
        this.PathFile=(file.getPath()).toString();
    } else {
        url = new URL(filename);
    }
    this.input = TikaInputStream.get(url, metadata);
    ContentHandler handler = new BodyContentHandler(outputstream);
    parser.parse(input, handler, metadata, context); 
    input.close();
  }

但输出如下  PAGE * MERGEFORMAT 36 文件内容不干净???如何从文档

获取字符串后删除格式页面

0 个答案:

没有答案