Question

我们一直在SSIS流程中使用iTextSharp库几年来从一组PDF考试文档中读取一些值。直到本周突然我们在调用PdfTextExtractor.GetTextFromPage方法时返回一个空字符串时，一切都运行良好。我将在这里包含代码：

    // Read the data from the blob column where the PDF exists
    byte[] byteBuffer = Row.FileData.GetBlobData(0, (int)Row.FileData.Length);

    using (var pdfReader = new PdfReader(byteBuffer))
    {

        // Here is the important stuff
        var extractStrategy = new LocationTextExtractionStrategy();

        // This call will extract the page with the proper data on it depending on the exam type
        // 1-page exams = NBOME - need to read first page for exam result data
        // 2-page exams = NBME - need to read second page for exam result data
        // The next two statements utilize this construct.
        var vendor = pdfReader.NumberOfPages == 1 ? "NBOME" : "NBME";

        *** THIS NEXT LINE GIVES THE EMPTY STRING
        var newText = PdfTextExtractor.GetTextFromPage(pdfReader, pdfReader.NumberOfPages == 1 ? 1 : 2, extractStrategy);

        var stringList = newText.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);

        var fileParser = FileParseFactory.GetFileParse(stringList, vendor);

        // Populate our output variables
        Row.ParsedExamName = fileParser.GetExamName(stringList);
        Row.DateParsed = DateTime.Now;
        Row.ParsedId = fileParser.GetStudentId(stringList);
        Row.ParsedTestDate = fileParser.GetTestDate(stringList);
        Row.ParsedTestDateString = fileParser.GetTestDateAsString(stringList);
        Row.ParsedName = fileParser.GetStudentName(stringList);
        Row.ParsedTotalScore = fileParser.GetTestScore(stringList);
        Row.ParsedVendor = vendor;
    }

顺便说一句，这不适用于所有PDF。为了解释更多，我们正在阅读考试文件。其中一种考试类型（NBME）似乎读得很好。但是，另一种类型（NBOME）则不是。然而，在本周之前，NBOME的阅读正常。

这使我认为这是PDF文件本身的内部格式更改。

另外，另外一点信息是实际的pdfReader有数据 - 我可以获得数据的byte []数组 - 但是获取任何文本的调用只是让我空了。

对不起，我无法显示任何考试数据或文件 - 这些信息很敏感。

有人见过这样的东西吗？如果是这样，任何可能的解决方案？

Answer 1

嗯 - 我们找到了答案。用户最初访问NBOME网站并下载PDF检查结果文件以导入我的解析系统。就像我说的，这已经有一段时间了。然而，最近（本周），用户开始不下载文件，而是使用PDF打印功能并将PDF文件打印为PDF。当她这样做时，问题就出现了。

最重要的是，看起来像打印PDF格式的PDF可能已经注入了一些字符或其他东西，导致通过iTextSharp读取PDF不会失败，而是给出一个空字符串。她应该继续直接下载它们。

感谢那些提出一些意见的人！

PdfTextExtractor.GetTextFromPage突然给出空字符串

1 个答案: