Tika无法从PDF文件中正确读取文本

时间:2016-07-30 16:56:53

标签: java pdf lucene pdfbox apache-tika

我是apache tika的新手。

我有两个带有不同字体的PDF文件,但是tika无法正确读取,一个是使用tika正确读取的Shruti字体,但另一个文件有tmg无法正确读取的lmg-rupen字体,是否有任何特定字体阅读蒂卡?

下面的

是我的代码片段:

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata(); 
FileInputStream inputstream = new FileInputStream(file);

ByteArrayOutputStream out = new ByteArrayOutputStream(); 
IOUtils.copy(inputstream, out); 
byte[] textBytes = out.toByteArray(); 
ByteArrayInputStream stream = new ByteArrayInputStream(textBytes);
ParseContext pcontext = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(stream, handler, metadata, pcontext);

LanguageDetector lDetector = new OptimaizeLangDetector().loadModels();
LanguageResult detect = lDetector.detect(handler.toString());
System.out.println("Language: " +detect); // Got languge 'de' but document languge is 'gu'

System.out.println(handler.toString()); // If font is Shruti content print correctly but font is LMG-RUPE than it gives wrong output

0 个答案:

没有答案