我正在尝试使用PDFbox(Java)和Tika(Python)从PDF阅读孟加拉语。但是字符无法正确呈现,大多数如下所示:
িপতা িঠকানা
但是应该像:
পিতা ঠিকানা
我正在使用以下代码:
(Java pdfbox)
String fileName = "input.pdf";
PDDocument document = null;
document = PDDocument.load( new File(fileName));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(2);
String pdfText = stripper.getText(document).toString();
java.io.PrintStream p = new java.io.PrintStream(System.out,false,"UTF-8");
p.println(pdfText);
(Python Tika)
import tika
from tika import parser
parsed = parser.from_file('input.pdf', xmlContent=True)
print(parsed["content"])
with open('filename.txt', encoding='utf-8', mode='w') as file:
file.write(parsed["content"])