我试图从表中读取数值数据作为python中的字符串。 (我已经尝试了很多不同的方法将表转换为CSV,Excel等,但似乎没有什么工作完美。所以我想尝试字符串方法) 每一行看起来都是这样的:
"ebit 34 894 38 445 28 013 26 356 12 387 -8 680 -2 760 838"
这里有8列。右边的最后一个数字:838属于一列,-2 760属于一列,12 387属于一列,依此类推。有没有人知道如何知道哪些数字属于哪一列?
答案 0 :(得分:1)
如果不能访问您的实际数据,很难解决这个问题,但基本上您需要使用复制粘贴以外的其他方式解析PDF表格,因为这会导致列间距之间的混淆和用作千位分隔符的空格。
首先,我建议使用Xpdf tools之类的东西,它是一组用于解析PDF文档的命令行实用程序。其中一个实用程序名为pdftotext.exe
,我在名为intrum_q317_presentation.pdf
的{{3}}上进行了测试
例如,要提取本文档第17页的表格:
您可以运行此命令:
C:\Program Files\xpdf-tools-win-4.00\bin64\pdftotext.exe" -table -f 17 -l 17 intrum_q317_presentation.pdf parsed_output.txt
产生此输出(在parsed_output.txt
中):
Cash flow statement
Q3 Q3 Dev YTD YTD Dev
SEK M 2017 2016 % 2017 2016 %
Operating earnings (EBIT) 977 506 93 1 921 1 379 39
Depreciation 163 40 308 245 120 104
Amortization and revaluation of purchased debt 866 389 123 1 845 1 137 62
Income tax paid -97 -33 194 -283 -187 51
Changes in factoring receivables 7 -25 -128 -39 -45 -13
Other changes in working capital 5 -60 -108 -8 -119 n/a
Financial net & other non-cash items -125 -6 1983 -486 -74 557
Cash flow from operating activities (CFFO) 1 796 811 121 3 195 2 211 45
Purchases of tangible and intangible fixed assets (CAPEX) -38 -33 15 -115 -103 12
Purchases of debt -1 124 -732 54 -4 317 -2 188 97
Purchases of shares in subsidiaries and associated companies -2 -1 100 -171 -89 92
Liquid assets in acquired subsidiaries 0 0 975 1
Other cash flow form investing activities -1 2 -150 -2 6 -133
Cash flow from investing activities (CFFI) -1 165 -764 52 -3 630 -2 373 53
Cash flow from investing activities (CFFI)
excl liquid assets in acquired subsidiaries -1 165 -764 52 -4 605 -2 374 94
Free cash flow (CFFO - CFFI) 631 47 1 243 -435 -167 160
Free cash flow (CFFO - CFFI) excl liquid
assets in acquired subsidiaries 631 47 1 243 -1 410 -168 739
17
您可以看到这与您的字符串非常相似,但各列之间的间距更大。
然后我们可以使用一些python将其解析为二维数组:
from tabulate import tabulate
import re
template = ''
with open('C:\\parsed_output.txt') as f:
raw_lines = [line for line in f.readlines() if line.strip() != '']
lines = raw_lines[1:-1] # ignore first and last lines
for raw_line in lines:
length = max([len(template), len(raw_line)])
old_template = template.ljust(length)
line = raw_line.ljust(length)
template = ''
for i in range(0,length):
template += ' ' if (old_template[i]==' ' and line[i]==' ') else 'X'
# try to work out the column widths, based on alignment of spaces:
column_widths = [len(x) for x in template.split()]
column_count = len(column_widths)
column_starts = [0]
start = 0
for i in range(1, column_count):
start = template.find(' X',start) + 1
column_starts.append(start)
column_starts.append(len(template)) # add final value to terminate right-most column
# now divide up each line using our column widths
rows=[]
for raw_line in lines:
line = raw_line.ljust(len(template))
row=[]
for i in range(0, column_count):
value = line[column_starts[i]:column_starts[i+1]].strip()
if i>0: value = re.sub('\s+', '', value)
row.append(value)
rows.append(row)
print(tabulate(rows, tablefmt='grid'))
...它给出了以下结果:
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| | Q3 | Q3 | Dev | YTD | YTD | Dev |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| SEK M | 2017 | 2016 | % | 2017 | 2016 | % |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Operating earnings (EBIT) | 977 | 506 | 93 | 1921 | 1379 | 39 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Depreciation | 163 | 40 | 308 | 245 | 120 | 104 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Amortization and revaluation of purchased debt | 866 | 389 | 123 | 1845 | 1137 | 62 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Income tax paid | -97 | -33 | 194 | -283 | -187 | 51 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Changes in factoring receivables | 7 | -25 | -128 | -39 | -45 | -13 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Other changes in working capital | 5 | -60 | -108 | -8 | -119 | n/a |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Financial net & other non-cash items | -125 | -6 | 1983 | -486 | -74 | 557 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Cash flow from operating activities (CFFO) | 1796 | 811 | 121 | 3195 | 2211 | 45 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Purchases of tangible and intangible fixed assets (CAPEX) | -38 | -33 | 15 | -115 | -103 | 12 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Purchases of debt | -1124 | -732 | 54 | -4317 | -2188 | 97 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Purchases of shares in subsidiaries and associated companies | -2 | -1 | 100 | -171 | -89 | 92 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Liquid assets in acquired subsidiaries | 0 | 0 | | 975 | 1 | |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Other cash flow form investing activities | -1 | 2 | -150 | -2 | 6 | -133 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Cash flow from investing activities (CFFI) | -1165 | -764 | 52 | -3630 | -2373 | 53 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Cash flow from investing activities (CFFI) | | | | | | |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| excl liquid assets in acquired subsidiaries | -1165 | -764 | 52 | -4605 | -2374 | 94 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Free cash flow (CFFO - CFFI) | 631 | 47 | 1243 | -435 | -167 | 160 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| Free cash flow (CFFO - CFFI) excl liquid | | | | | | |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
| assets in acquired subsidiaries | 631 | 47 | 1243 | -1410 | -168 | 739 |
+--------------------------------------------------------------+-------+------+------+-------+-------+------+
当然,它并不完美(例如'Q3 2017'应该在一个单元格中),并且不保证可以使用您的确切数据(例如,您可能需要手动调整列宽),但它应该得到你开始了。