如何在tabula-py中创建页面范围?

时间:2018-03-30 12:51:19

标签: python pandas pdf range tabula

在Python 3中,我有一个PDF文件“Ativos_Fevereiro_2018_servidores.pdf”,共有6,041页。我在Ubuntu的机器上。该文件位于:https://drive.google.com/file/d/1P8kF0gUOVls6sOGed4R0C2PlVF5RFtU6/view?usp=sharing

在每个页面上,页面顶部有两行文字。在表格下方,带有标题和两列。每个表格在36行中,在最后一页上少了

在每个页面的末尾,在表格之后,还有一行文字

我想从这个PDF创建一个CSV,只考虑页面中的表格。并忽略表格前后的文本

为了避免java内存错误,我想我会将文件拆分为300页的组。我在tabula-py中这样做了:

import tabula
import pandas as pd


dfs = []

for i in range(1,6041, 300):
    if i != 1:
        i = i + 1

    i2 = i + 300

    if i2 > 6041:
        i2 = 6041

    print(i)
    print(i2)

    try:
        df = tabula.read_pdf("Ativos_Fevereiro_2018.pdf", encoding='latin-1', spreadsheet=True, pages='i-i2', header=0)
        dfs.append(df)
        print('Page ', len(df), ' parsed.')
    except:
        print('Error on page: ', i)

output = pd.concat(dfs)
output.to_csv('servidores_rj_ativos_fev_18.csv', encoding='utf-8', index=False)

但我所做的范围是错误的:

1
301
Error: Syntax error in page range specification
Error on page:  1
302
602
...
Error: Syntax error in page range specification
Error on page:  5702
6002
6041
Error: Syntax error in page range specification
Error on page:  6002
Traceback (most recent call last):
  File "roboseguranca_pdftocsv.py", line 26, in <module>
    output = pd.concat(dfs)
  File "/home/reinaldo/Documentos/Code/intercept/seguranca/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 212, in concat
    copy=copy)
  File "/home/reinaldo/Documentos/Code/intercept/seguranca/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 245, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

请问,如何更正范围错误?

1 个答案:

答案 0 :(得分:1)

要使用的范围必须将其作为字符串传递,所以将整数转换为字符串并将它们与' - '组合:

pages=(str(i)+'-'+str(i2))

其他一些事情:

  • encoding='utf-8'声明中使用tabula.read_pdf
  • 如果您还想查看抛出的错误,请扩展except语句,例如:

except Exception as e:         print('Error in range ', i, '-', 'i2: ', e)