第1页：

Question

考虑以下循环，该循环遍历PDF中的每个页面，读取文本，然后将PDF页面进一步划分为使用定义的列位置：

列位置的定义如下（通过命令行）：

'{"1":{"position":"15"}, "2":{"position": "20"}}'

这是我的脚本：

npages = 2  # Number of pages in the PDF.
column = {}


for n in range(npages):

    for i, col in enumerate(COLUMNS):
        out = [...] #The specific text from the PDF page, inside the defined column area
        column[i+1] = ({"row": str(out)})

现在，考虑一下我有一个PDF文件，该文件有两页长。它包含以下文本：

第1页：

Page 1 Col 1 Text                 Page 1 Col 2 Text

第2页：

Page 2 Col 1 Text                 Page 2 Col 2 Text

当前，我的代码将在下面输出：

{  
   "1":{  
      "row":"Page 2 \u2013 Col 1.\n\n\f"
   },
   "2":{  
      "row":"Page 2 \u2013 Col 2\n\n\f"
   }
}

因此，理想情况下，我想做的是生成如下所示的JSON输出：

{  
   "1":[  
      {  
         "row":"Page 1 Col 1 Text"
      },
      {  
         "row":"Page 2 Col 1 Text"
      }
   ],
   "2":[  
      {  
         "row":"Page 1 Col 2 Text"
      },
      {  
         "row":"Page 2 Col 2 Text"
      }
   ],
}

因此，基本上，列的边界将在所有页面上共享-并且必须将每列的内容添加到正确的列索引中，并且在\n的每一行out上，应该进一步将其添加到列索引内的row索引中。

Python3甚至可以做到吗？我会更好地保存PDF文件的文本内容，然后根据文件夹中的每个文件创建JSON字符串吗？

Answer 1

假设示例中的所有其他内容都可以正常工作-为column使用defaultdict并附加您的信息。

import collections
column = collections.defaultdict(list)
for n in range(npages):
    for i, col in enumerate(COLUMNS,1):
        out = [...] #The specific text from the PDF page, inside the defined column area
        column[i].append({"row": str(out)})

Python3-生成具有相同键名的关联数组

第1页：

第2页：

1 个答案: