复杂的正则表达式比预期的要少

时间:2018-12-22 20:34:49

标签: python regex text-mining

我试图在 Python 2.7 中摆弄一个正则表达式,以便在文本中捕获编号的脚注。从PDF转换后的文本如下:

test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim 
Participation, in which it decided that the victims “may, through their legal 

1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

 8/117 
representatives, participate in the present appeal proceedings for the purpose of 
presenting their views and concerns in respect of their personal interests in the issues 
on appeal”.3

8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the 
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The 
Prosecutor filed a confidential redacted version of the Document in Support of the 
Appeal on 22 March 2013, and a public redacted version of the Document in Support 
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of 
the Appeal, the Prosecutor’s entire third ground of appeal was redacted. 

"""

请注意,带编号的段落是我文本的常规内容,以数字和点开头(例如'5。')。  理想情况下,我想得到类似的东西:

[(1,"The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. "), (2, "A more detailed procedural history is set out in Annex 2 of this judgment." 

我获取脚注的Python代码是:

regex = ur"""
(\r?\n)(?P<num>\d+)(?!\.) #first line
(?P<text>(?:\s(.|\r?\n)+?\s?(?:\n\n|\Z))) #following lines
"""
result = re.findall(regex, test_str, re.U|re.VERBOSE | re.X |re.MULTILINE)

这给了我

[(u'\n', u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n\n', u'.')]

即只有第一个脚注,而我都需要偏离路线

欢迎任何想法!

2 个答案:

答案 0 :(得分:1)

您可以使用此正则表达式将数据按需要分为两部分,第一部分是数字,第二部分在段落数据之后,

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:th="http://www.thymeleaf.org">
<head>
    <style>
table td{
vertical-align:top;
border:solid 1px #888;
padding:10px;
}

    </style>
</head>
<body>
<h1>My Thymeleaf Error Page</h1>
<table>
    <tr>
        <td>Date</td>
        <td th:text="${timestamp}"/>
    </tr>
    <tr>
        <td>Path</td>
        <td th:text="${path}"/>
    </tr>
    <tr>
        <td>Error</td>
        <td th:text="${error}"/>
    </tr>
    <tr>
        <td>Status</td>
        <td th:text="${status}"/>
    </tr>
    <tr>
        <td>Message</td>
        <td th:text="${message}"/>
    </tr>
    <tr>
        <td>Exception</td>
        <td th:text="${exception}"/>
    </tr>
    <tr>
        <td>Trace</td>
        <td>
            <pre th:text="${trace}"/>
        </td>
    </tr>
</table>
</body>
</html>

说明:

  • (?s)(\d+)\n +(.*?)\s*(?=\d+\n) ->启用点以匹配我们在此处需要的新行
  • (?s)->匹配一个或多个数字并将它们放在组1中
  • (\d+)->匹配换行符,\n +会占用第二个捕获组中不需要的空间
  • " +"->该组捕获预期的数据并将其放置在group2中
  • (.*?)->这样只会占用不需要的空间,而无需进行预期的文本捕获
  • \s*->先行一点,以停止捕获想要的文本

Live Demo

这是您代码的修改版本,

(?=\d+\n)

哪个提供了您期望的以下输出

import re

test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim 
Participation, in which it decided that the victims “may, through their legal 

1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

 8/117 
representatives, participate in the present appeal proceedings for the purpose of 
presenting their views and concerns in respect of their personal interests in the issues 
on appeal”.
3

8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the 
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The 
Prosecutor filed a confidential redacted version of the Document in Support of the 
Appeal on 22 March 2013, and a public redacted version of the Document in Support 
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of 
the Appeal, the Prosecutor’s entire third ground of appeal was redacted. 

"""

result = re.findall(r'(?s)(\d+)\n +(.*?)\s*(?=\d+\n)', test_str)

print(result)

答案 1 :(得分:1)

我相信这个正则表达式:(^\d+(?!\.).*?)(?=^\s*\d)可以按照您的描述工作。

Demo

Python演示

>>> import re
>>> print ''.join(re.findall(r'(^\d+(?!\.).*?)(?=^\s*\d)', test_str, flags=re.M|re.S))
1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

如果要捕获脚注编号与文本分开:

>>> re.findall(r'^(\d+)((?!\.).*?)(?=\s*^\d)', test_str, flags=re.M|re.S)
[(u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n'), (u'2', u'\n A more detailed procedural history is set out in Annex 2 of this judgment. \nICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A\n')]