预期输出

Question

我正在尝试解析PDF文件，例如表格布局。考虑以下PDF文件：

我正在尝试做，因此用户可以为PDF文件定义列（如布局），如下所示：

这是我的代码：

areas = {}
areas[0] = (0, 0, 150, 792)
areas[1] = (150, 0, 350, 792)
areas[2] = (350, 0, 612, 792)

with pdfplumber.open(mypdf_file) as pdf:
     for i, area in enumerate(areas):
         area = areas[i]
         p0 = pdf.pages[0]
         p0 = p0.crop(area)
         words = p0.extract_words()

下面是words =包含的提取输出：

[[{'bottom': Decimal('99.708'),
  'text': 'Page'
 },
 {'bottom': Decimal('99.708'),
  'text': '1,'
 },
 {'bottom': Decimal('99.708'),
  'text': 'col'
 },
 {'bottom': Decimal('99.708'),
  'text': '1.'
}]
[{'bottom': Decimal('128.988'),
  'text': 'Page'
 },
 {'bottom': Decimal('128.988'),
  'text': '1,'
 },
 {'bottom': Decimal('128.988'),
  'text': 'col'
 },
 {'bottom': Decimal('128.988'),
  'text': '2.'
}]
[{'bottom': Decimal('143.628'),
  'text': 'Page'
 },
 {'bottom': Decimal('143.628'),
  'text': '1,'
 },
 {'bottom': Decimal('143.628'),
  'text': 'col'
 },
 {'bottom': Decimal('143.628'),
  'text': '3'
}]

我正在尝试找出一种解析此信息的方法，因此它将代表我上面的图像。因为我只有bbox信息（bottom），而没有实际的换行符。因此，如果我解析以上数据：

for i, area in enumerate(areas):
   [....]
   cols[i + 1] = " ".join(map(itemgetter("text"), words))

我得到的单词（结合行）为：

{1: 'Page 1, col 1.', 2: 'Page 1, col 2.', 3: 'Page 1, col 3'}

预期输出

我试图通过使用bottom值来确定它是否像表一样解析，以确定一个单词是否在同一行上。

但是，我不确定如何解决此问题吗？我们是否能够在下一列中对照下一行来检查每一行/单词，以查看它们是否在同一行上？

输出类似的东西：

{
    "1": [{
        "row": "Page 1, col 1.",
        "row": "",
        "row": ""
    }],
    "2": [{
        "row": "",
        "row": "Page 1, col 2.",
        "row": ""
    }],
    "3": [{
        "row": "",
        "row": "",
        "row": "Page 1, col 3."
    }]
  }

从PDF提取单词并像表格一样进行解析

预期输出

0 个答案: