从PDF阅读孟加拉语

时间:2018-10-12 03:49:28

标签: java python unicode pdfbox apache-tika

我正在尝试使用PDFbox(Java)和Tika(Python)从PDF阅读孟加拉语。但是字符无法正确呈现,大多数如下所示:

িপতা িঠকানা

但是应该像:

পিতা ঠিকানা

我正在使用以下代码:

(Java pdfbox)

 String fileName = "input.pdf"; 
 PDDocument document = null;
 document = PDDocument.load( new File(fileName));
 PDFTextStripper stripper = new PDFTextStripper();
 stripper.setStartPage(2);
 String pdfText = stripper.getText(document).toString();

 java.io.PrintStream p = new java.io.PrintStream(System.out,false,"UTF-8");
 p.println(pdfText);

(Python Tika)

import tika
from tika import parser

parsed = parser.from_file('input.pdf', xmlContent=True)
print(parsed["content"])
with open('filename.txt', encoding='utf-8', mode='w') as file:
  file.write(parsed["content"])

0 个答案:

没有答案