从pdf文件中读取表格数据

时间:2018-04-04 09:41:03

标签: python file pdf

我试图从表中读取数值数据作为python中的字符串。 (我已经尝试了很多不同的方法将表转换为CSV,Excel等,但似乎没有什么工作完美。所以我想尝试字符串方法) 每一行看起来都是这样的:

"ebit 34 894 38 445 28 013 26 356 12 387 -8 680 -2 760 838"

这里有8列。右边的最后一个数字:838属于一列,-2 760属于一列,12 387属于一列,依此类推。有没有人知道如何知道哪些数字属于哪一列?

1 个答案:

答案 0 :(得分:1)

如果不能访问您的实际数据,很难解决这个问题,但基本上您需要使用复制粘贴以外的其他方式解析PDF表格,因为这会导致列间距之间的混淆和用作千位分隔符的空格

首先,我建议使用Xpdf tools之类的东西,它是一组用于解析PDF文档的命令行实用程序。其中一个实用程序名为pdftotext.exe,我在名为intrum_q317_presentation.pdf的{​​{3}}上进行了测试

例如,要提取本文档第17页的表格:

sample PDF file

您可以运行此命令:

C:\Program Files\xpdf-tools-win-4.00\bin64\pdftotext.exe" -table -f 17 -l 17 intrum_q317_presentation.pdf parsed_output.txt

产生此输出(在parsed_output.txt中):

Cash flow statement

                                                                  Q3   Q3    Dev    YTD     YTD     Dev

SEK M                                                         2017     2016  %      2017    2016    %

Operating earnings (EBIT)                                         977  506   93     1 921   1 379   39

Depreciation                                                      163  40    308    245     120     104

Amortization and revaluation of purchased debt                    866  389   123    1 845   1 137   62

Income tax paid                                                   -97  -33   194    -283    -187    51

Changes in factoring receivables                                  7    -25   -128   -39     -45     -13

Other changes in working capital                                  5    -60   -108   -8      -119    n/a

Financial net & other non-cash items                          -125     -6    1983   -486    -74     557

Cash flow from operating activities (CFFO)                    1 796    811   121    3 195   2 211   45

Purchases of tangible and intangible fixed assets (CAPEX)         -38  -33   15     -115    -103    12

Purchases of debt                                             -1  124  -732  54     -4 317  -2 188  97

Purchases of shares in subsidiaries and associated companies      -2   -1    100    -171    -89     92

Liquid assets in acquired subsidiaries                            0    0            975     1

Other cash flow form investing activities                         -1   2     -150   -2      6       -133

Cash flow from investing activities (CFFI)                    -1  165  -764  52     -3 630  -2 373  53

Cash flow from investing activities (CFFI)

excl liquid assets in acquired subsidiaries                   -1  165  -764  52     -4 605  -2 374  94

Free cash flow (CFFO - CFFI)                                      631  47    1 243  -435    -167    160

Free cash flow (CFFO - CFFI) excl liquid

assets in acquired subsidiaries                                   631  47    1 243  -1 410  -168    739

                                                                                                17

您可以看到这与您的字符串非常相似,但各列之间的间距更大。

然后我们可以使用一些python将其解析为二维数组:

from tabulate import tabulate
import re

template = ''

with open('C:\\parsed_output.txt') as f:
    raw_lines = [line for line in f.readlines() if line.strip() != '']
    lines = raw_lines[1:-1] # ignore first and last lines
    for raw_line in lines:
        length = max([len(template), len(raw_line)])
        old_template = template.ljust(length)
        line = raw_line.ljust(length)
        template = ''
        for i in range(0,length):
            template += ' ' if (old_template[i]==' ' and line[i]==' ') else 'X'

# try to work out the column widths, based on alignment of spaces:
column_widths = [len(x) for x in template.split()]
column_count = len(column_widths)
column_starts = [0]
start = 0
for i in range(1, column_count):
    start = template.find(' X',start) + 1
    column_starts.append(start)
column_starts.append(len(template)) # add final value to terminate right-most column

# now divide up each line using our column widths
rows=[]
for raw_line in lines:
    line = raw_line.ljust(len(template))
    row=[]
    for i in range(0, column_count):
        value = line[column_starts[i]:column_starts[i+1]].strip()
        if i>0: value = re.sub('\s+', '', value)
        row.append(value)
    rows.append(row)

print(tabulate(rows, tablefmt='grid'))

...它给出了以下结果:

+--------------------------------------------------------------+-------+------+------+-------+-------+------+
|                                                              | Q3    | Q3   | Dev  | YTD   | YTD   | Dev  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| SEK M                                                        | 2017  | 2016 | %    | 2017  | 2016  | %    |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Operating earnings (EBIT)                                    | 977   | 506  | 93   | 1921  | 1379  | 39   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Depreciation                                                 | 163   | 40   | 308  | 245   | 120   | 104  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Amortization and revaluation of purchased debt               | 866   | 389  | 123  | 1845  | 1137  | 62   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Income tax paid                                              | -97   | -33  | 194  | -283  | -187  | 51   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Changes in factoring receivables                             | 7     | -25  | -128 | -39   | -45   | -13  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Other changes in working capital                             | 5     | -60  | -108 | -8    | -119  | n/a  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Financial net & other non-cash items                         | -125  | -6   | 1983 | -486  | -74   | 557  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Cash flow from operating activities (CFFO)                   | 1796  | 811  | 121  | 3195  | 2211  | 45   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Purchases of tangible and intangible fixed assets (CAPEX)    | -38   | -33  | 15   | -115  | -103  | 12   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Purchases of debt                                            | -1124 | -732 | 54   | -4317 | -2188 | 97   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Purchases of shares in subsidiaries and associated companies | -2    | -1   | 100  | -171  | -89   | 92   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Liquid assets in acquired subsidiaries                       | 0     | 0    |      | 975   | 1     |      |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Other cash flow form investing activities                    | -1    | 2    | -150 | -2    | 6     | -133 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Cash flow from investing activities (CFFI)                   | -1165 | -764 | 52   | -3630 | -2373 | 53   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Cash flow from investing activities (CFFI)                   |       |      |      |       |       |      |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| excl liquid assets in acquired subsidiaries                  | -1165 | -764 | 52   | -4605 | -2374 | 94   |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Free cash flow (CFFO - CFFI)                                 | 631   | 47   | 1243 | -435  | -167  | 160  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Free cash flow (CFFO - CFFI) excl liquid                     |       |      |      |       |       |      |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| assets in acquired subsidiaries                              | 631   | 47   | 1243 | -1410 | -168  | 739  |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+

当然,它并不完美(例如'Q3 2017'应该在一个单元格中),并且不保证可以使用您的确切数据(例如,您可能需要手动调整列宽),但它应该得到你开始了。