Question

每个月我都需要从.pdf文件中提取一些数据来创建Excel表格。

我能够将.pdf文件转换为文本，但我不确定如何提取和保存我想要的特定信息。现在我有了这段代码：

DataGridTextColumn

这就是结果：

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr

print convert_pdf_to_txt("FA20150518.pdf")

好的，现在我在变量convert_pdf_to_txt中有了文本。

我想提取这些信息：客户，账单数量，价格，到期日期和支付方式。

客户名称总是关闭“EMAIL：buendialogistica@gmail.com”

账单数量总是下降“FACTURA”

价格总是下降两行“Vencimientos：”

失效日期总是下降“Vencimientos：”

付钱的方式总是下来“Banco：”

我认为做这样的事情。如果我可以将此文本转换为列表并可以执行以下操作：

搜索客户：

    >>> 
AVILA 72, VALLDOREIX
08197 SANT CUGAT DEL VALLES
(BARCELONA)
TELF: 935441851
NIF: B65512725
EMAIL: buendialogistica@gmail.com

JOSE LUIS MARTINEZ LOPEZ

AVDA. DEL ESLA, 33-D
24240 SANTA MARIA DEL PARAMO
LEON
TELF: 600871170

FECHA
17/06/15

FACTURA
  20150518

CLIENTE
43000335

N.I.F.

71548163 B

PÁG.

1

Nº VIAJE

RUTA

DESTINATARIO / REFERENCIA

KG

BULTOS

IMPORTE

2015064210-08/06/15

CERDANYOLA DEL VALLES -> VINAROS

FERRER ALIMENTACION - VINAROZ

2,000.0

1

         150,00

TOTAL IMP.

%

IMPORTE

BASE

         150,00

         150,00

%
 21,00

IVA

%

REC.

TOTAL FRA.

(€)

          31,50

         181,50

Eur

Forma Pago:
Banco:

CONTADO

Vencimientos:
17/06/15
181,50

搜索帐号：

 i=0
 while i < lengthlist
   if listitem[i] == "EMAIL: buendialogistica@gmail.com"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

我不知道如何在Excel中保存但我确信我可以在论坛中找到示例，但首先我只需要提取这些数据。

Answer 1

你有正确的想法

string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]
print(custData)

Answer 2

让我们举一个更简单的例子，我希望代表你的问题。

你有一个字符串stringPDF，如下所示：

name1 \n
\n
value1 \n
name2 \n
value2 \n
\n
name3 \n
otherValue \n
value3 \n

值是名称后面的X行（在您的示例中，X通常是1，有时是2，但我们只能说它可以是任意数字）。 \n表示换行符（当您打印字符串时，它会在多行上打印）

首先，我们将字符串转换为行列表，通过拆分有换行符的位置：

>>> stringList=stringPDF.split("\n")
>>> print(stringList)
['name1 ', '', 'value1 ', 'name2 ', 'value2 ', '', 'name3 ', 'otherValue ', 'value3 ', '']

根据您的字符串，您可能需要清理它。在这里，我最后有一些额外的空格（'name1 '而不是'name1'）。我使用列表推导和strip()删除它：

stringList=[line.strip() for line in stringList]

一旦我们有了一个正确的列表，我们可以定义一个返回值的简单函数，给定名称和X（名称和值之间的X行）：

def get_value(l,name,Xline):
    indexName=l.index(name)  #find the index of the name in the list
    indexValue=indexName+Xline # add X to this index
    return l[indexValue]  #get the value

>>>print(get_value(stringList,"name2",1))
"value2"

Answer 3

尝试这样的事情：

txtList = convert_pdf_to_txt("FA20150518.pdf").splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1

for idx, line in enumerate(txtList):
    if "EMAIL: buendialogistica@gmail.com" in line:
        nameIdx = idx + 1 # in your example it should be +2...

    if "FACTURA" in line:
        billNumIdx = idx + 1

    if "Vencimientos:" in line:
        priceIdx = idx + 2
        expirDateIdx = idx + 1

    if "Banco:" in line:
        paymentIdx = idx + 1

name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''

如果您确定关键线仅包含您要查找的内容（＆＃34; FACTURA＆＃34;依此类推），您可以用

替换条件

if line == "FACTURA":

Answer 4

感谢您的帮助，我从您给我的两个示例中获取代码，现在我可以提取我想要的所有信息。

# -*- coding: cp1252 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr


factura = "FA20150483.pdf"
#ejemplo 1

string = convert_pdf_to_txt(factura)
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]



#ejemplo 2
txtList = convert_pdf_to_txt(factura).splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1

for idx, line in enumerate(txtList):
    if line == "EMAIL: buendialogistica@gmail.com":
        nameIdx = idx +2 # in your example it should be +2...

    if line == "FACTURA":
        billNumIdx = idx + 1

    if "Vencimientos:" in line:
        priceIdx = idx + 2
        expirDateIdx = idx + 1

    if "Banco:" in line:
        paymentIdx = idx + 1

name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''


print expirDate

billNum = billNum.replace("Â Â ", "")


print billNum


custData['Name'] = custData['Name'].replace("Â", "")

print custData['Name']


custData['paymentType'] = custData['paymentType'].replace("Â", "")

print custData['paymentType']

print price

几个例子：

    >>> 
25/06/15
20150480
BABY RACE S.L.
REMESA DIA 25 FECHA FACTURA
15,23
>>> ================================ RESTART ================================
>>> 
05/06/15
20150481
LOFT CUINA, S.L.
DIA 5 FECHA FACTURA
91,79
>>> ================================ RESTART ================================
>>> 
05/06/15
20150482
GRAFIQUES MOGENT S.L.
DIA 5 FECHA FACTURA
128,42
>>> ================================ RESTART ================================
>>> 
30/06/15
20150483
CHIEMIVALL SL
30 DIAS FECHA FACTURA
1.138,58
>>>

从.pdf中提取特定数据并保存在Excel文件中

4 个答案: