我将PDF阅读成python,并希望从中提取特定段落。为此,我使用python并尝试通过正则表达式获取选择。为了说明这种情况,这是一个示例。
INTERNATIONAL MONETARY FUND 7\n\x0cBELGIUM\n\n\n\nPOLICY DISCUSSIONS—MAINTAINING THE REFORM\nMOMENTUM\n7. The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n require following through on plans to gradually move toward structural balance.\n\n\uf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n further labor and product market reforms are needed to increase productivity growth, raise\n potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n and proactive policies.3\n\n8. The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n9. Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n\n8
每个段落都以一个数字,一个或两位数字开头,后跟一个点和三至七个空格。末尾由下一个双换行\n\n
组成,后跟一个数字,一个或两位数字以及一个点。请注意,这也应作为下一个起点。在上面的示例中,我应该找到三个段落:
第一段:
- 当前的复苏为增强比利时经济的弹性和增长潜力提供了机会。 \ n \ n \ uf0b7首先,在公共债务超过GDP的100%且仅开始下降的情况下,比利时政府是否有能力应对未来的冲击,\ n \ n \ uf0b7重建缓冲区并获得更可持续的财政状况还有很长的路要走。 \ n这需要按照计划逐步实现结构性平衡。\ n \ n \ uf0b7其次,在可预见的未来,实际GDP增长预计仅为1.5%,\ n需要进一步的劳动力和产品市场改革\ n \ n \ uf0b7第三,尽管金融危机爆发以来金融业已经复苏并且总体上是稳健的,但周期性的脆弱性正在增加,也带来了新的挑战正在出现,这表明需要提高警惕\ n并采取积极的政策。3\ n \ n
第二段:
- 去年夏天,政府同意采取一套与税收,劳动力市场和社会福利有关的新措施(表2和专栏1)。最显着的改革是将比利时的企业所得税(CIT)税率从34%降低到25%,并将在未来三年内逐步实施(中小企业将从2018年开始将税率降低20%) 。为了补偿由此产生的收入损失,对名义利率扣除(NID)进行了修改,仅适用于增量公司股权,而不适用于总股本,并且引入了新的反比利时避税措施,以符合比利时的欧盟义务。4 \ n \ n
一起采取的措施旨在提高比利时的竞争力,同时保持收入中立。
最后是第三个:
政策讨论的重点是保持改革势头的重要性,而不是使自满情绪下降。要实现预算平衡目标,需要各级政府努力提高支出效率和维护收入(A节)。\ n政策与改革相结合可以帮助提高生产率,包括增加对基础设施的投资和增强竞争。服务(B节)。为了充分发挥比利时的就业潜力,解决劳动力市场的严重分散问题至关重要(C节)。为了保持金融稳定,当局应解决抵押贷款市场中的漏洞,并谨慎地向欧洲银行联盟过渡(D节)。\ n \ n \ n \ n \ n3 \ n对比利时金融业进行了全面评估\ n4 \ n NID旨在通过对利息的可抵扣性进行补充,补充由公司股权与名义收益相乘的n抵扣,以中和债务和股权的企业所得税处理方法。\ n4利率。\ n \ n
我尝试使用以下正则表达式:r'(?m)[0-99].*[.] {3,7} (.*?) \n\n
,并选择从开始到结束的所有内容
(?m)[0-99].*[.] {3,7}
:要分别标识每一行的开始。\n\n
指定结尾。 但是,它找不到任何东西。
答案 0 :(得分:3)
[0-99]
模式是错误的,因为它与从0
到9
的任意1位数字匹配。参见Why doesn't [01-12] range work as expected?。 re.M
((?m)
)修改了^
和$
锚点,但是您都没有使用该模式。
您可以使用
r'(?sm)^\d\d?\. {3,7}(.*?)(?=\n\n\d\d?\. |\Z)'
请参见regex demo。
详细信息
(?sm)
-启用了re.DOTALL
和re.MULTILINE
选项^
-一行的开头\d\d?
-1或2位数字(0
至99
)\.
-一个点<code> {3,7}</code> - 3 to 7 spaces (replace with
[^ \ S \ r \ n] {3,7}`以匹配任何水平空白)(.*?)
-组1:尽可能少包含0个字符(?=\n\n\d\d?\. |\Z)
-一个位置,紧随其后的是两个换行符(\n\n
),然后是1或2位数字(\d\d?
)和一个点后跟空格或({{1} })整个字符串(|
)的结尾。\Z
输出:
import re
s="INTERNATIONAL MONETARY FUND 7\n\x0cBELGIUM\n\n\n\nPOLICY DISCUSSIONS—MAINTAINING THE REFORM\nMOMENTUM\n7. The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n require following through on plans to gradually move toward structural balance.\n\n\uf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n further labor and product market reforms are needed to increase productivity growth, raise\n potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n and proactive policies.3\n\n8. The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n9. Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n\n8"
for r in re.findall(r'(?sm)^\d\d?\. {3,7}(.*?)(?=\n\n\d\d?\. |\Z)', s):
print(r, "\n---------")