如何使用正则表达式从pdf提取结构化数据

时间:2019-06-24 23:42:12

标签: python pdf

我有一个pdf,它重复以下多次:

31-10-2018
NATIONAL
Initial Hearing
Imputed: Maynor Steven Sevilla Flores
Crime: murder
Relation of facts: murder at 10 am in the neighborhood cox 20...…

NOTE: xxxxxxxx...
NOTE2:xxxxxxxx...
DATA: xxxxxxx...

01-11-2018
NATIONAL
Initial Hearing
Imputed: James Graden 
Crime: murder
Relation of facts: murder at 11 am in the neighborhood bit 45...…

.
.
.

我想实现一个python代码:

import PyPDF2
import re

PATH_DOWNLOAD_PDF = /home/Dev/Freelance/Webscrapping/test/file.pdf'
pdf_file = open(PATH_DOWNLOAD_PDF, 'rb') 
read_pdf = PyPDF2.PdfFileReader(pdf_file)
#.
#.
#.

我需要使用正则python表达式读取pdf以获得结果:

预期结果:列表字典PYTHON:

[
 {
  “Date” : “31-10-2018”,
  “Judge” : “NATIONAL”,
  “Initial Hearing” : 
        {
         “imputed” : “Maynor Steven Sevilla Flores”
         “Crime” :  murder
         “Relation of facts” “murder at 10 am in the neighborhood cox 20...”
        }
 },
 {
   “Date” : “01-11-2018”,
   “Judge” : “NATIONAL”,
   “Initial Hearing” : 
        {
        “imputed” : “ames Graden”
        “Crime” :  murder
        “Relation of facts” “murder at 11 am in the neighborhood bit 45...…”
        }
 }
]

我有点编程,请帮忙

0 个答案:

没有答案