使用正则表达式解析财务报表

时间:2018-12-08 03:56:59

标签: python regex python-3.x

我正在研究正则表达式查询,以将特定模式的文本返回到组中。这是我使用的正则表达式:r"([\w+ \-? \w+]* [\w+ ]+ [\(?\w+ \)?]*) (\(?[\d,-]+\)?) (\(?[\d,-]+\)?)"。这是我正在解析的示例行以及我希望输出的内容:

1) String: LOSS BEFORE INCOME TAXES (900,000) (900,000)
Desired output: [('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)')]
Final result: correct 

2) String: INCOME TAXES (RECOVERED) (90,000) (90,000)
Desired output: [('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)')]
Final result: correct

3) String: RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999
Desired output: [('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999')]
Final result: correct

4) String: EXPENSES
Desired output: ['EXPENSES']
Final result: correct

5) String: Subcontracts 8,058 2,655
Desired output: [('Subcontracts', '8,000,000')]
Final result: ['Subcontracts 8', '', '058 2', '', '655', '']

6) String: Business taxes 116 -
Desired output: [('Business taxes', '116', '-')]
Final result: ['Business taxes 116 ', '', '']

7) String: 600,000 600,000
Desired output: [(600,000), (600,000)]
Final result: ['642', '', '437 629', '', '070', '']

8) String: Salaries, wages and benefits 400,000 400,000
Desired output: [('Salaries, wages and benefits', '400,000', '400,000')]
Final result: [(' wages and benefits', '463,437', '466,742')]

我不确定自己做错了什么或缺少什么,但是5、6、7和8遇到了问题。如何调整上述查询,使其能够解决所有上述情况?预先感谢!

3 个答案:

答案 0 :(得分:1)

您可以尝试与此伴侣

^([a-z, \(\)-]*?)?\(?([\d,]+)?\)?\s*?\(?([\d,-]+)?\)?$
  

说明

  • ^-字符串开头的锚点。
  • ([a-z, \(\)-]+?)?-将任何字符a匹配到z,或匹配,(或')`或'-'零个或多个时间(惰性模式)。
  • \(?-匹配(?使其为可选)。
  • ([\ d,] +)? -一次或多次匹配任意数字或,。(?使其为可选)。
  • \)-匹配)
  • \s*?-匹配空间零个或多个时间。
  • (?([\d,-]+)?\)?-匹配任意数字或-
  • $-字符串结尾。

Demo

答案 1 :(得分:1)

我认为此正则表达式将满足您的要求:

^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$

它将查找一组字母字符,以字母开头,可能包括[(),%;-],但不以(,数字或空格结尾,然后是两组可能的{ {1}}包含数字和(),。所有组都是可选的,以允许没有描述或没有编号的匹配行。

在Python中:

-

输出:

import re
data = """LOSS BEFORE INCOME TAXES (900,000) (900,000)
INCOME TAXES (RECOVERED) (90,000) (90,000)
RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999
EXPENSES
Subcontracts 8,058 2,655
Business taxes 116 -
600,000 600,000
GROSS PROFIT (50%; 2016 - 50%) 500,000 500,000
Bad debts - 50
Salaries, wages and benefits 400,000 400,000"""
regex = re.compile('^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$', re.MULTILINE)
print regex.findall(data)

Demo on rextester

答案 2 :(得分:-1)

尝试使用正则表达式

r"([\w ,()-]*)[\(?[\d, -]*\)?]*[\(?[\d, -]*\)?]*"