Question

我是apache tika的新手。

我有两个带有不同字体的PDF文件，但是tika无法正确读取，一个是使用tika正确读取的Shruti字体，但另一个文件有tmg无法正确读取的lmg-rupen字体，是否有任何特定字体阅读蒂卡？

下面的

是我的代码片段：

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata(); 
FileInputStream inputstream = new FileInputStream(file);

ByteArrayOutputStream out = new ByteArrayOutputStream(); 
IOUtils.copy(inputstream, out); 
byte[] textBytes = out.toByteArray(); 
ByteArrayInputStream stream = new ByteArrayInputStream(textBytes);
ParseContext pcontext = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(stream, handler, metadata, pcontext);

LanguageDetector lDetector = new OptimaizeLangDetector().loadModels();
LanguageResult detect = lDetector.detect(handler.toString());
System.out.println("Language: " +detect); // Got languge 'de' but document languge is 'gu'

System.out.println(handler.toString()); // If font is Shruti content print correctly but font is LMG-RUPE than it gives wrong output

Tika无法从PDF文件中正确读取文本

0 个答案: