Question

我有一个程序，我正在寻求自动化，这包括从PDF文件中获取一系列表格。目前我可以通过在任何查看器（Adobe，Sumatra，okular等等）中打开文件来执行此操作，只需按Ctrl + A，Ctrl + C，Ctrl + V就可以到记事本，并且它保持每行与一个合理的对齐足够的格式，然后我可以运行正则表达式并复制并粘贴到Excel中，以便随后进行任何操作。

当尝试使用python执行此操作时，我尝试了各种模块，PDFminer是使用this example for instance工作的主要模块。但它会在单个列中返回数据。其他选项仅包括getting it as an html table，但在这种情况下，它会增加额外的拆分中间表，这使得解析更复杂，甚至偶尔在第一页和第二页之间切换列。

我现在已经有了一个临时解决方案，但是当我可能只是在解析器中缺少一个核心选项或我需要考虑一些基本选项时，我担心我正在重新发明轮子PDF渲染器可以解决这个问题。

如何处理它的任何想法？

Answer 1

我最终实现了一个基于this one的解决方案，该解决方案本身是由代码tgray修改而来的。它在我到目前为止测试的所有情况下都能保持一致，但我还没有确定如何直接操作pdfminer的参数来获得所需的行为。

Answer 2

发布这个只是为了得到一段代码，与py35一起工作，用于类似csv的解析。列中的拆分是最简单的，但对我有用。

以answer为起点，以此为借口。

也放入openpyxl，因为我更喜欢将结果直接放在excel中。

# works with py35 & pip-installed pdfminer.six in 2017
def pdf_to_csv(filename):
    from io import StringIO
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.get_text()
                    # the line is now an unsorted dict

            for y in sorted(lines.keys()):
                line = lines[y]
                # combine close letters to form columns
                xpos = tuple(sorted(line.keys()))
                new_line = []
                temp_text = ''
                for i in range(len(xpos)-1):
                    temp_text += line[xpos[i]]
                    if xpos[i+1] - xpos[i] > 8:
                        # the 8 is representing font-width
                        # needs adjustment for your specific pdf
                        new_line.append(temp_text)
                        temp_text = ''
                # adding the last column which also manually needs the last letter
                new_line.append(temp_text+line[xpos[-1]])

                self.outfp.write(";".join(nl for nl in new_line))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())

    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(PDFPage.get_pages(fp,
                                pagenos, maxpages=maxpages,
                                password=password,caching=caching,
                                check_extractable=True)):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

fn = 'your_file.pdf'
result = pdf_to_csv(fn)

lines = result.split('\n')
import openpyxl as pxl
wb = pxl.Workbook()
ws = wb.active
for line in lines:
    ws.append(line.split(';'))
    # appending a list gives a complete row in xlsx
wb.save('your_file.xlsx')

使用与复制+粘贴相同的布局从PDF文件中获取数据

2 个答案: