我试图在apache tika示例之后将doc格式转换为纯文本,但是我在使用撇号和Bullet点时遇到问题,是否可以指定如何处理这些字符?
public String parseToPlainText() throws IOException, SAXException, TikaException {
BodyContentHandler handler = new BodyContentHandler();
InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
parser.parse(stream, handler, metadata);
return handler.toString();
} finally {
stream.close();
}
}
谢谢!