此代码返回大量\ 0 \ 0s,并从PDF中仅提取一些英语短语。不返回任何日文文本。
我正在使用Unicode编码,所以我不确定这里发生了什么。
StringBuilder text = new StringBuilder(2000);
string fullFileName = @"c:\my_japanaese_pdf.pdf";
PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(fullFileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
(Windows 7 x64,iTextSharp 5.0.2.0)
由于
赖安