通过iTextsharp从用户输入的HTML解析PDF时,我可以检测当前页面吗?

时间:2013-11-26 17:33:49

标签: vb.net itextsharp

我正在根据用户输入的HTML生成一个非常大的PDF(300 +页)。由于那里有一些很棒的样品,我的工作非常漂亮。我的下一个要求是生成一个动态目录,其中包含指向章节开始的PDF中的那些位置的内部链接。我有一部分工作部分。我可以创建有效的内部PDF链接。我需要帮助的部分是,页码是未知的。我已经尝试先创建主PDF然后旋转它以获取基于查找文本“第一章”的页码,但考虑到文档的大小和章节的数量,它太慢了。

添加到文档时是否可以检测当前页码?当我从HTML创建PDF时,我知道当我在新的章节时,但有没有办法向iTextSharp询问我们当前在哪个页面,所以我可以在我的目录中使用该号码?那样我可以在主文档旁边构建它然后合并它们?那里有更好的想法吗?

这是我从用户输入HTML生成PDF的方式:

Dim document As New Document()

Dim strManualFile As String = "file.pdf"

PdfWriter.GetInstance(document, New FileStream(strManualFile, FileMode.Create, FileAccess.Write, FileShare.ReadWrite))
document.Open()

Dim htmlarraylistBody As List(Of IElement) = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(New StringReader(GetManualHTML()), Nothing)
For l As Integer = 0 To htmlarraylistBody.Count - 1

   document.Add(DirectCast(htmlarraylistBody(l), IElement))

Next

document.Close()
document.Dispose()

1 个答案:

答案 0 :(得分:2)

PdfWriter.GetInstance()返回一个对象,您可以查询该对象以查找当前页码,这是您应该知道的第一件事。如果您可以控制HTML,我会注入一个标志变量,您可以在For循环中查看。如果找到标志变量,请执行某些操作,否则只需正常添加内容。

只是一个快速警告,HTMLWorker已经被弃用了很长时间而且没有得到维护。所有工作都是在支持CSS的XmlWorker库中完成的。如果由于许可证更改you should probably read this而使用旧版本而无法找到有关旧许可证的神话和事实。

下面是一个完整的工作示例,它显示了flag变量。在顶部,我创建了一些示例HTML,您明显删除它并替换为您的真实HTML。然后我创建一个标准文档并像你一样遍历每个项目。在循环内部,我检查标志变量,如果找到则存储它,否则就像你一样添加元素。

此代码的目标是iTextSharp 5.4.4。如果您使用的是旧版本的iTextSharp,那么Using语句可能无效,只需将它们转换为Dim语句并删除End Using(或升级到最新版本)。请参阅代码以获取其他评论

''//File to write to
Dim TestFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf")

''//Create a flag value to search for. We won't write this to the PDF, it is just for searching.
Dim FlagValue = "!!UNIQUE TEXT!!"

''//Build our sample HTML. The real version of this would get the HTML from another source ideally.
Dim sampleHTML = <body/>
For I As Integer = 1 To 10

    ''//Just before inserting our chapter headings we insert our flag value appended with the current chapter number.
    ''//NOTE: This might need to be played with a little bit. I'm not sure if a new page is created by the previous entity
    ''//      closing or the new entity starting.
    sampleHTML.Add(String.Format("{0}{1}", FlagValue, I))
    sampleHTML.Add(<h1><%= String.Format("Chapter {0}", I) %></h1>)

    ''//Add some some paragraphs
    For J As Integer = 1 To 100
        sampleHTML.Add(<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
                          Suspendisse ac arcu porta, tempor justo eu, tincidunt eros.
                          Integer lorem dolor, pretium sit amet vehicula dapibus,
                          faucibus a tellus.</p>)
    Next
Next

''//This will be our collection of chapter numbers and the actual page numbers that they correspond to.
Dim PageNumbers As New Dictionary(Of String, Integer)

''//Standard PDF setup here, nothing special
Using fs As New FileStream(TestFile, FileMode.Create, FileAccess.Write, FileShare.None)
    Using doc As New Document()
        Using writer = PdfWriter.GetInstance(doc, fs)
            doc.Open()

            ''//Parse our HTML
            Dim htmlarraylistBody = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(New StringReader(sampleHTML.ToString()), Nothing)

            ''//Loop through each item
            For Each Elem In htmlarraylistBody

                ''//Some HTML elements freak the system out so you should check if they are content first.
                If Elem.IsContent() Then

                    ''//If the current element is a paragraph and start with our flag value
                    If (TypeOf Elem Is Paragraph) AndAlso DirectCast(Elem, Paragraph).Content.StartsWith(FlagValue) Then

                        ''//Add that to our master collection but DO NOT write it to the PDF
                        PageNumbers.Add(DirectCast(Elem, Paragraph).Content.Replace(FlagValue, ""), writer.PageNumber)
                    Else

                        ''//Otherwise just write to the PDF normally
                        doc.Add(Elem)
                    End If
                End If
            Next

            doc.Close()
        End Using
    End Using
End Using