Question

我一直在使用pdftotext.exe从pdf中提取文本。使用这个文本的准确性很好。但问题是我无法识别粗体和斜体文本。如何识别提取的文本是粗体还是斜体？

我曾尝试过其他一些插件，如CSWTestingReflow，PDF解析器等。但为了更好的文本准确性，我使用了pdftotext.exe

任何想法都会很明显......

代码：

objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"
''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"
    If fso.FileExists(sReadPDF & "_Text.txt") = True Then
                'Read the text file
                Set adoStreamOut = New ADODB.Stream
                'adoStreamOut.Charset = "utf-8"
                adoStreamOut.Charset = "us-ascii"
                If adoStreamOut.State Then adoStreamOut.Close
                adoStreamOut.Open
                adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"
                sText = adoStreamOut.ReadText
    End If

 DoEvents
sText = Trim(sText)
sText = Trim(Replace(sText, Chr(12), ""))
sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))
sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))
sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))
sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))
sText = Trim(Replace(sText, vbCrLf, " "))
sText = Trim(Replace(sText, ".|||", "." & vbCrLf))
sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))
sText = Trim(Replace(sText, "-|||", ""))
sText = Trim(Replace(sText, "||||||", "--"))
sText = Trim(Replace(sText, "--", "—"))
Do
 sText = Trim(Replace(sText, "  ", " "))
Loop Until InStr(sText, "  ") = False

如何使用带有粗体斜体标识的pdftotext.exe提取文本

0 个答案: