每个月我都需要从.pdf文件中提取一些数据来创建Excel表格。
我能够将.pdf文件转换为文本,但我不确定如何提取和保存我想要的特定信息。现在我有了这段代码:
DataGridTextColumn
这就是结果:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
print convert_pdf_to_txt("FA20150518.pdf")
好的,现在我在变量convert_pdf_to_txt中有了文本。
我想提取这些信息:客户,账单数量,价格,到期日期和支付方式。
客户名称总是关闭“EMAIL:buendialogistica@gmail.com”
账单数量总是下降“FACTURA”
价格总是下降两行“Vencimientos:”
失效日期总是下降“Vencimientos:”
付钱的方式总是下来“Banco:”
我认为做这样的事情。如果我可以将此文本转换为列表并可以执行以下操作:
搜索客户:
>>>
AVILA 72, VALLDOREIX
08197 SANT CUGAT DEL VALLES
(BARCELONA)
TELF: 935441851
NIF: B65512725
EMAIL: buendialogistica@gmail.com
JOSE LUIS MARTINEZ LOPEZ
AVDA. DEL ESLA, 33-D
24240 SANTA MARIA DEL PARAMO
LEON
TELF: 600871170
FECHA
17/06/15
FACTURA
20150518
CLIENTE
43000335
N.I.F.
71548163 B
PÁG.
1
Nº VIAJE
RUTA
DESTINATARIO / REFERENCIA
KG
BULTOS
IMPORTE
2015064210-08/06/15
CERDANYOLA DEL VALLES -> VINAROS
FERRER ALIMENTACION - VINAROZ
2,000.0
1
150,00
TOTAL IMP.
%
IMPORTE
BASE
150,00
150,00
%
21,00
IVA
%
REC.
TOTAL FRA.
(€)
31,50
181,50
Eur
Forma Pago:
Banco:
CONTADO
Vencimientos:
17/06/15
181,50
搜索帐号:
i=0
while i < lengthlist
if listitem[i] == "EMAIL: buendialogistica@gmail.com"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
我不知道如何在Excel中保存但我确信我可以在论坛中找到示例,但首先我只需要提取这些数据。
答案 0 :(得分:4)
你有正确的想法
string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
print(custData)
答案 1 :(得分:1)
让我们举一个更简单的例子,我希望代表你的问题。
你有一个字符串stringPDF
,如下所示:
name1 \n
\n
value1 \n
name2 \n
value2 \n
\n
name3 \n
otherValue \n
value3 \n
值是名称后面的X行(在您的示例中,X通常是1,有时是2,但我们只能说它可以是任意数字)。 \n
表示换行符(当您打印字符串时,它会在多行上打印)
首先,我们将字符串转换为行列表,通过拆分有换行符的位置:
>>> stringList=stringPDF.split("\n")
>>> print(stringList)
['name1 ', '', 'value1 ', 'name2 ', 'value2 ', '', 'name3 ', 'otherValue ', 'value3 ', '']
根据您的字符串,您可能需要清理它。在这里,我最后有一些额外的空格('name1 '
而不是'name1'
)。我使用列表推导和strip()
删除它:
stringList=[line.strip() for line in stringList]
一旦我们有了一个正确的列表,我们可以定义一个返回值的简单函数,给定名称和X(名称和值之间的X行):
def get_value(l,name,Xline):
indexName=l.index(name) #find the index of the name in the list
indexValue=indexName+Xline # add X to this index
return l[indexValue] #get the value
>>>print(get_value(stringList,"name2",1))
"value2"
答案 2 :(得分:1)
尝试这样的事情:
txtList = convert_pdf_to_txt("FA20150518.pdf").splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1
for idx, line in enumerate(txtList):
if "EMAIL: buendialogistica@gmail.com" in line:
nameIdx = idx + 1 # in your example it should be +2...
if "FACTURA" in line:
billNumIdx = idx + 1
if "Vencimientos:" in line:
priceIdx = idx + 2
expirDateIdx = idx + 1
if "Banco:" in line:
paymentIdx = idx + 1
name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''
如果您确定关键线仅包含您要查找的内容(&#34; FACTURA&#34;依此类推),您可以用
替换条件if line == "FACTURA":
答案 3 :(得分:1)
感谢您的帮助,我从您给我的两个示例中获取代码,现在我可以提取我想要的所有信息。
# -*- coding: cp1252 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
factura = "FA20150483.pdf"
#ejemplo 1
string = convert_pdf_to_txt(factura)
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
#ejemplo 2
txtList = convert_pdf_to_txt(factura).splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1
for idx, line in enumerate(txtList):
if line == "EMAIL: buendialogistica@gmail.com":
nameIdx = idx +2 # in your example it should be +2...
if line == "FACTURA":
billNumIdx = idx + 1
if "Vencimientos:" in line:
priceIdx = idx + 2
expirDateIdx = idx + 1
if "Banco:" in line:
paymentIdx = idx + 1
name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''
print expirDate
billNum = billNum.replace("Â Â ", "")
print billNum
custData['Name'] = custData['Name'].replace("Â", "")
print custData['Name']
custData['paymentType'] = custData['paymentType'].replace("Â", "")
print custData['paymentType']
print price
几个例子:
>>>
25/06/15
20150480
BABY RACE S.L.
REMESA DIA 25 FECHA FACTURA
15,23
>>> ================================ RESTART ================================
>>>
05/06/15
20150481
LOFT CUINA, S.L.
DIA 5 FECHA FACTURA
91,79
>>> ================================ RESTART ================================
>>>
05/06/15
20150482
GRAFIQUES MOGENT S.L.
DIA 5 FECHA FACTURA
128,42
>>> ================================ RESTART ================================
>>>
30/06/15
20150483
CHIEMIVALL SL
30 DIAS FECHA FACTURA
1.138,58
>>>