Question

我正在从PDF提取文本并将其保存为.csv文件。下图显示了我要从PDF中提取的文本：

当前，我能够提取文本，但不能摆脱表示页码和索引的数字（即，文本开头，结尾处的数字1、5、1.1、5、1.2等。）。下面是我的工作代码（我正在使用python 3.5）：

CREATE PROCEDURE sp_add_nutri
       -- Add the parameters for the stored procedure here
       @Nutri_Id int,
       @Nutri_Name varchar(50),
       @uom char(10)

AS
BEGIN
       -- SET NOCOUNT ON added to prevent extra result sets from
       -- interfering with SELECT statements.
       SET NOCOUNT ON;

    -- Insert statements for procedure here
       INSERT INTO nutrient
              (Nutri_Id,Nutri_Name,uom)
       VALUES
              (@Nutri_Id, @Nutri_Name, @uom)
END

预先感谢您的帮助。

Answer 1

pdfminer文档here在第2.4节中说明了操作方法。

为了记录，我将在此处复制并粘贴相关代码。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

# Open a PDF document.
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
# Get the outlines of the document.
outlines = document.get_outlines()
for(level,title,dest,a,se) in outlines:
    print (' '.join(title.split(' ')[1:]))

调整打印语句以适当地回答问题。

Answer 2

您可以通过mutool提取目录：

mutool show your.pdf outline > toc.txt

然后将txt的内容转换为csv文件。

我从这个答案中知道mutool：Extract toc from pdf by mutool

从PDF中提取文本（目录）忽略页码和索引编号

2 个答案: