我一直在使用pdftotext.exe从pdf中提取文本。使用这个文本的准确性很好。但问题是我无法识别粗体和斜体文本。 如何识别提取的文本是粗体还是斜体?
我曾尝试过其他一些插件,如CSWTestingReflow,PDF解析器等。但为了更好的文本准确性,我使用了pdftotext.exe
任何想法都会很明显......
代码:
objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"
''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"
If fso.FileExists(sReadPDF & "_Text.txt") = True Then
'Read the text file
Set adoStreamOut = New ADODB.Stream
'adoStreamOut.Charset = "utf-8"
adoStreamOut.Charset = "us-ascii"
If adoStreamOut.State Then adoStreamOut.Close
adoStreamOut.Open
adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"
sText = adoStreamOut.ReadText
End If
DoEvents
sText = Trim(sText)
sText = Trim(Replace(sText, Chr(12), ""))
sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))
sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))
sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))
sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))
sText = Trim(Replace(sText, vbCrLf, " "))
sText = Trim(Replace(sText, ".|||", "." & vbCrLf))
sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))
sText = Trim(Replace(sText, "-|||", ""))
sText = Trim(Replace(sText, "||||||", "--"))
sText = Trim(Replace(sText, "--", "—"))
Do
sText = Trim(Replace(sText, " ", " "))
Loop Until InStr(sText, " ") = False