我是apache tika的新手。
我有两个带有不同字体的PDF文件,但是tika无法正确读取,一个是使用tika正确读取的Shruti字体,但另一个文件有tmg无法正确读取的lmg-rupen字体,是否有任何特定字体阅读蒂卡?
下面的是我的代码片段:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ByteArrayOutputStream out = new ByteArrayOutputStream();
IOUtils.copy(inputstream, out);
byte[] textBytes = out.toByteArray();
ByteArrayInputStream stream = new ByteArrayInputStream(textBytes);
ParseContext pcontext = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(stream, handler, metadata, pcontext);
LanguageDetector lDetector = new OptimaizeLangDetector().loadModels();
LanguageResult detect = lDetector.detect(handler.toString());
System.out.println("Language: " +detect); // Got languge 'de' but document languge is 'gu'
System.out.println(handler.toString()); // If font is Shruti content print correctly but font is LMG-RUPE than it gives wrong output