解析PDF后清理文本文件

时间:2014-08-03 17:15:44

标签: python parsing pdf python-3.x text

我已经解析了PDF并尽可能地清理了它,但我仍然坚持在文本文件中对齐信息。

我的输出如下:

Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.

我希望它看起来像这样:

Zone:    1
Report Name:    ARREST
Incident Time:    01:41
Location of Occurrence:    1300 block Liverpool St
Neighborhood:    Highland Park
Incident:    14081898
Age:    27
Gender:    M
Section, Description:
3921(a): Theft by Unlawful Taking or Disposition - Movable item
3925: Receiving Stolen Property.
903: Criminal Conspiracy.

我试图在列表中进行枚举,但问题是某些字段不存在。所以这会导致错误的信息。

以下是解析PDF的代码

import os
import urllib2
import time
from datetime import datetime, timedelta
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def parsePDF(infile, outfile):

    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    outtype = 'text'
    imagewriter = None
    rotation = 0
    stripcontrol = False
    layoutmode = 'normal'
    codec = 'utf-8'
    pageno = 1
    scale = 1
    caching = True
    showpageno = True
    laparams = LAParams()
    rsrcmgr = PDFResourceManager(caching=caching)

    if outfile:
        outfp = file(outfile, 'w+')
    else:
        outfp = sys.stdout

    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
    fp = file(infile, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp, pagenos,
                                      maxpages=maxpages, password=password,
                                      caching=caching, check_extractable=True):

        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return  


# Set time zone to EST
#os.environ['TZ'] = 'America/New_York'
#time.tzset()

# make sure folder system is set up
if not os.path.exists("../pdf/"):
    os.makedirs("../pdf/")
if not os.path.exists("../txt/"):
    os.makedirs("../txt/")

# Get yesterday's name and lowercase it
yesterday = (datetime.today() - timedelta(1))
yesterday_string = yesterday.strftime("%A").lower()

# Also make a numberical representation of date for filename purposes
yesterday_short = yesterday.strftime("%Y%m%d")

# Get pdf from blotter site, save it in a file
pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read();
f = file("../pdf/" + yesterday_short + ".pdf", "w+")
f.write(pdf)
f.close()

# Convert pdf to text file
parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt")

# Save text file contents in variable
parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()

这是我到目前为止所拥有的。

import os

OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"]    


if not os.path.exists("../out/"):
    os.makedirs("../out/")  
with open("../txt/20140731.txt", 'r') as file:
    blotterList = file.readlines()

with open("../out/test2.txt", 'w') as outfile:
    cleanList = []
    for line in blotterList:
        if not any ([o in line for o in OddsnEnds]):
            cleanList.append(line)
    while '\n' in cleanList:
        cleanList.remove('\n')
    for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']:
        print ('Incident:%s' % cleanList[i])
    for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']:
            print ('Time:%s' % cleanList[i+1])  

但是枚举让我得到了

的输出
Time:16:20

Time:17:40

Time:17:53

Time:18:05

Time:Location of Occurrence

因为没有给出该事件的时间。另外注意是所有字符串以\ n。

结尾

非常感谢任何和所有的想法和帮助。

2 个答案:

答案 0 :(得分:0)

我最喜欢使用pdftotext选项抓取PDF文件以使用-layout(来自poppler实用程序)的文本。它非常适合保留文档的原始布局。

您可以使用subprocess模块从Python中使用它。

答案 1 :(得分:0)

通常,从PDF文件中提取文本(特别是当您想要包含文本的格式/间距/布局时)被认为是一项可能无法始终100%准确工作的任务。我从一家公司的支持技术人员那里了解到这一点,该公司生产了一个流行的图书馆(xpdf),用于从PDF中提取文本,不久前我正在该区域开展项目。那时,我已经探索了几个用于从文本中提取PDF的库,包括xpdf和其他一些库。有明显的技术原因导致他们为什么不能总是给出完美的结果(尽管他们在许多情况下都这样做);这些原因与PDF格式的性质以及如何生成PDF有关。从某些PDF中提取文本时,即使您使用库的选项(如keep_format = True或等效文件),也可能无法保留布局和间距。

此问题的唯一永久解决方案是不需要从PDF文件中提取文本。相反,总是尝试使用生成PDF文件的数据格式和数据源,并使用它进行文本提取/操作。当然,如果您无法访问这些来源,说起来容易做起来难。