我想要求将doc docx转换为此处代码中的文件文本 在这里输入代码
public DokumenExtractor(String filename) {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
try {
process(filename);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
this.PathFile=(file.getPath()).toString();
} else {
url = new URL(filename);
}
this.input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
但输出如下 PAGE * MERGEFORMAT 36 文件内容不干净???如何从文档
获取字符串后删除格式页面