从PDF文件中提取文本,例如表格

时间:2019-06-26 12:42:08

标签: python json python-3.x python-3.7 pdftotext

我正在尝试为PDF文件创建自己的“表格提取”功能,可以在PDF文档顶部定义“类似于表格的”列,然后以表格格式提取提取的文本。

我在下面的列中定义了页面上的30%和60%:

Table Columns

这些列的提供方式如下:

{"1":{"position":"33"},"2":{"position":"60"}}

下面是我的Python方法,该方法读取PDF文件,拆分每一列并提取文本。

def convertPDFTextToTableData(pdf_file, save_dir, COLUMNS):

    #Get the width/height of the PDF file.
    dimensions = PdfFileReader(open(pdf_file, 'rb'))
    dimensions = dimensions.getPage(0).mediaBox
    width = float(dimensions[2])
    height = float(dimensions[3])

    col = COLUMNS[str(1)]

    # Use ghostscript to get the number of pages
    os.system(GHOSTSCRIPT + ' -dBATCH -q -dNODISPLAY -c "("' +
              pdf_file + '") (r) file runpdfbegin pdfpagecount = quit" >tmp.tmp')

    with open("tmp.tmp", "r") as f:
        npages = int(f.read())
    os.remove("tmp.tmp")

    column = defaultdict(list)

    firstWidth = 0
    for i, col in enumerate(COLUMNS):

        col = COLUMNS.get(str(col))
        pixelsrightcorner = round(
            width*(float(COLUMNS[str(i + 1)]['position']) / 100)-firstWidth)
        area = (firstWidth, 0, pixelsrightcorner, int(height))

        nextWidth = pixelsrightcorner

        cmd = ['pdftotext', '-f', str(1), '-l', str(npages), '-x', str(area[0]), '-y', str(area[1]),
               '-W', str(area[2]), '-H', str(area[3]), str(pdf_file), '-layout', '-']

        proc = subprocess.Popen(
            cmd, stdout=subprocess.PIPE, bufsize=0, text=True)
        out, err = proc.communicate()

        for line in out.splitlines():
            line = str(line)
            column[i + 1].append({"row": str(line)})

        firstWidth = round(pixelsrightcorner + firstWidth)

    # Last column (rest of the page)
    lastColumn = int(len(COLUMNS))
    pixelsrightcorner = (
        (100 - float(COLUMNS[str(i + 1)]['position']))/100)*width

    area = (firstWidth, 0, int(pixelsrightcorner), int(height))

    cmd = ['pdftotext', '-f', str(1), '-l', str(npages), '-x', str(area[0]), '-y', str(area[1]),
           '-W', str(area[2]), '-H', str(area[3]), str(pdf_file), '-layout', '-']

    proc = subprocess.Popen(
        cmd, stdout=subprocess.PIPE, bufsize=0, text=True)
    out, err = proc.communicate()

    for line in out.splitlines():
        column[lastColumn + 1].append({"row": str(line)})

    # Ensure that all arrays are the same length.
    # Fill up with " row:"" " for shorther arrays.
    longest_array = (max(map(len, column.values())))
    final = {k: v + [{'row': ''}] * (longest_array - len(v))
             for k, v in column.items()}

    # Create JSON file.
    f = open(save_dir + 'table_data.json', "w+")
    f.write(json.dumps(final))
    f.close()

下面的代码输出。如您所见,文本提取按应有的方式工作。但是,提取的格式没有。如您所见,行格式不正确。

Output

下面是JSON字符串,实际上是在table_data.json中输出的。请注意,我尝试将其最小化,因为它很长。

{
   "1":[
      {
         "row":"Commercial Invoice #1200"
      },
      [...]
      {
         "row":"www.domain.com"
      },
      [...]
      {
         "row":" Faucets PO#900"
      },
      {
         "row":""
      },
      {
         "row":" Faucets PO#901"
      },
      [...]
      {
         "row":" Total"
      },
      {
         "row":""
      }
   ],
   "2":[
      {
         "row":"Your Invoice: TMS"
      },
      {
         "row":""
      },
      [...]
   ],
   "3":[
      {
         "row":"   Date: 25\/06 \u2013 2019"
      },
      [...]
      {
         "row":"USD 900"
      },
      {
         "row":""
      },
      {
         "row":"USD 100"
      },
      [...]
      {
         "row":"USD 1000"
      },
      [...]
   ]
}

我认为pdftotext命令的 坐标(-x -y -W -H)会导致布局/格式出现一些问题-因为它仅查看特定的页面的一部分,并将其解释为整个PDF文件。这导致布局(线)发生偏移。

关于如何解决此问题的任何想法?我知道pdftotext也可以提供-bbox-layout,但是此选项只能提取整个PDF文件,而不能提取文件的特定部分。

0 个答案:

没有答案