Question

我在使用VB.NET时遇到有关PDF上的计数页面的错误。实际上我的代码可以工作，我可以计算PDF的页面，但某些PDF我的代码无法计算。 PDF是否需要设置任何设置？

以下是我现在使用的示例代码：

Dim SR As New StreamReader("C:\Users\lee_chun_yong\Desktop\New folder\abc.pdf")
Dim PDFData As String = SR.ReadToEnd
Dim StartIndex As Integer   'Starting index of the Pages Object
Dim EndIndex As Integer     'Ending index of the Pages Object
Dim CountIndex As Int16     'Starting index of "/Count"
Dim chars() As Char = {"/", ">"}
Dim tmp As String
Dim CountEndIndex As Int16  'Index of next "/" after "/Count"
Dim tmpIndex1, tmpIndex2 As Integer
Dim PageCount As Integer
Dim TypePagesIndex As Integer

Do
    'Get an Object of type 'Pages' from PDF file
    'It can be "/Type /Pages" or "/Type/Pages"
    tmpIndex1 = PDFData.IndexOf("/Type /Pages")
    tmpIndex2 = PDFData.IndexOf("/Type/Pages")
    'Different possibilities of 2 indices
    If tmpIndex1 > -1 And tmpIndex1 < tmpIndex2 Then
        TypePagesIndex = tmpIndex1
    ElseIf tmpIndex2 > -1 And tmpIndex2 < tmpIndex1 Then
        TypePagesIndex = tmpIndex2
    ElseIf tmpIndex1 = -1 And tmpIndex2 > -1 Then
        TypePagesIndex = tmpIndex2
    ElseIf tmpIndex2 = -1 And tmpIndex1 > -1 Then
        TypePagesIndex = tmpIndex1
    Else  'tmpIndex1 = -1 And tmpIndex2 = -1
        Exit Do
    End If

    tmp = PDFData.Substring(0, TypePagesIndex)
    StartIndex = tmp.LastIndexOf("<<")
    tmp = PDFData.Substring(TypePagesIndex)
    EndIndex = TypePagesIndex + tmp.IndexOf(">>") + 1
    tmp = PDFData.Substring(StartIndex, EndIndex - StartIndex + 1)
    'Now tmp="<< /Kids, /Count etc >>"
    'the pagecount is just after "/Count " in tmp
    CountIndex = tmp.IndexOf("/Count")
    CountIndex += 7  'Move index to the end of "/Count "

    tmp = tmp.Substring(CountIndex)
    'now tmp="Pagecount ....>>"
    'Pagecount is followd by a newline like char and then "/" or ">>"
    CountEndIndex = tmp.IndexOfAny(chars)
    tmp = tmp.Substring(0, CountEndIndex) 'Get the PageCount
    If PageCount < Val(tmp) Then
        PageCount = Val(tmp)
    End If
    PDFData = PDFData.Substring(EndIndex + 1)
Loop

Answer 1

你的代码做了很多假设，这些假设不一定是真的：

您希望页面树节点（尤其是页面树根节点）可以清楚地读取。不一定是这种情况，这些节点可以放在对象流中，而对象流又可以被压缩。这可能会让您错过部分或全部页面树节点。
您希望页面树节点中的 / Type 和 / Pages 可以紧密相互跟随，也可以用一个空格分隔。不一定是这种情况，中间可以有任何种类和数量的空白字符，甚至可能有评论！你又可以在这里错过节点。
您希望 Count 值立即成为整数;它也可能是对包含该整数的某个间接对象的引用。在这种情况下，您的代码将对象编号作为页数。
您认为 / Type / Pages 只能出现在当前正在使用的页面树节点中。这是错的。这个字符序列也可以出现
- 在未从页面树引用的节点中;操作PDF时，某些PDF处理器不会删除旧对象，而只是停止引用它们。如果他们删除页面，您的代码仍会看到前一个更高的计数，因此，假设更高的页数;
- 在私人申请数据中; PDF允许插入私人应用程序数据，其中可能包含带有 / Type / Pages 的词典和 Count 条目，其值与实际页数无关;
- 任意PDF字符串;解释PDF文件结构的PDF可能在页面内容（可能是未压缩的）或元数据中包含 /类型/页面。在这种情况下，您的代码将检查附近的字典，该字典不是页面树节点，但可能仍然有计数条目;
- ; PDF可以包含嵌入文件;如果PDF中嵌入了另一个PDF而没有进一步压缩，则代码会将嵌入PDF的页面树节点视为外部PDF的页面树节点。

在您的代码中肯定还有一些更多的假设，但上面的那些假设立即浮现在我的脑海中。

我建议您使用一些现有的PDF库来检索页数。

如果无法做到这一点，请阅读PDF，因为它应该被阅读。即读取预告片或交叉引用流字典以查找目录，读取目录以查找页面树根节点，读取该根节点的计数。使用交叉引用流或表来查找这些对象。换句话说：请务必遵循规范ISO 32000-1，而不是仅仅检查一些示例PDF。

使用VB.NET计算pdf页面错误

1 个答案: