尝试使用tabula(tabula-py)读取pdf文件时遇到以下错误。
有没有办法像大熊猫或其他一些lib一样在python中读取pdf?
请提出建议。
>>> from tabula import read_pdf
>>> df = read_pdf('OpTransactionHistory28-08-2018.pdf')
Aug 29, 2018 10:40:27 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug 29, 2018 10:40:27 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug 29, 2018 10:40:32 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 328 fonts
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/karn/.local/lib/python3.6/site-packages/tabula/wrapper.py", line 119, in read_pdf
return pd.read_csv(io.BytesIO(output), **pandas_options)
File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/home/karn/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 4, saw 9
一种解决方法是pdftotext
转换。
$ pdftotext OpTransactionHistory28-08-2018.pdf
只需查看@ace的provide链接并找到相关的内容即可
>>> from tabula import read_pdf
>>> df = read_pdf('OpTransactionHistory28-08-2018.pdf', pages='all', encoding='ISO-8859-1', multiple_tables=True)
答案 0 :(得分:1)
pandas层的错误通常可能是由于表之间的列数不同而引起的,因为pandas试图从tabula-java输出中提取一个DataFrame。使用multiple_tables=True
可以避免这种限制,因为表格具有表格格式的边界。
我也注意到了这个相关的错误,但是看起来与我看到的有所不同。 https://github.com/chezou/tabula-py#i-faced-cparsererror-how-can-i-extract-multiple-tables
如果能提供您的熊猫版本,将不胜感激。