我试图在 Python 2.7 中摆弄一个正则表达式,以便在文本中捕获编号的脚注。从PDF转换后的文本如下:
test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim
Participation, in which it decided that the victims “may, through their legal
1
The full citation, including the ICC registration reference of all designations and abbreviations used in
this judgment are included in Annex 1.
2
A more detailed procedural history is set out in Annex 2 of this judgment.
ICC-01/04-02/12-271-Corr 07-04-2015 7/117 EK A
8/117
representatives, participate in the present appeal proceedings for the purpose of
presenting their views and concerns in respect of their personal interests in the issues
on appeal”.3
8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The
Prosecutor filed a confidential redacted version of the Document in Support of the
Appeal on 22 March 2013, and a public redacted version of the Document in Support
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of
the Appeal, the Prosecutor’s entire third ground of appeal was redacted.
"""
请注意,带编号的段落是我文本的常规内容,以数字和点开头(例如'5。')。 理想情况下,我想得到类似的东西:
[(1,"The full citation, including the ICC registration reference of all designations and abbreviations used in
this judgment are included in Annex 1. "), (2, "A more detailed procedural history is set out in Annex 2 of this judgment."
我获取脚注的Python代码是:
regex = ur"""
(\r?\n)(?P<num>\d+)(?!\.) #first line
(?P<text>(?:\s(.|\r?\n)+?\s?(?:\n\n|\Z))) #following lines
"""
result = re.findall(regex, test_str, re.U|re.VERBOSE | re.X |re.MULTILINE)
这给了我
[(u'\n', u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n\n', u'.')]
即只有第一个脚注,而我都需要偏离路线
欢迎任何想法!
答案 0 :(得分:1)
您可以使用此正则表达式将数据按需要分为两部分,第一部分是数字,第二部分在段落数据之后,
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:th="http://www.thymeleaf.org">
<head>
<style>
table td{
vertical-align:top;
border:solid 1px #888;
padding:10px;
}
</style>
</head>
<body>
<h1>My Thymeleaf Error Page</h1>
<table>
<tr>
<td>Date</td>
<td th:text="${timestamp}"/>
</tr>
<tr>
<td>Path</td>
<td th:text="${path}"/>
</tr>
<tr>
<td>Error</td>
<td th:text="${error}"/>
</tr>
<tr>
<td>Status</td>
<td th:text="${status}"/>
</tr>
<tr>
<td>Message</td>
<td th:text="${message}"/>
</tr>
<tr>
<td>Exception</td>
<td th:text="${exception}"/>
</tr>
<tr>
<td>Trace</td>
<td>
<pre th:text="${trace}"/>
</td>
</tr>
</table>
</body>
</html>
说明:
(?s)(\d+)\n +(.*?)\s*(?=\d+\n)
->启用点以匹配我们在此处需要的新行(?s)
->匹配一个或多个数字并将它们放在组1中(\d+)
->匹配换行符,\n +
会占用第二个捕获组中不需要的空间" +"
->该组捕获预期的数据并将其放置在group2中(.*?)
->这样只会占用不需要的空间,而无需进行预期的文本捕获\s*
->先行一点,以停止捕获想要的文本这是您代码的修改版本,
(?=\d+\n)
哪个提供了您期望的以下输出
import re
test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim
Participation, in which it decided that the victims “may, through their legal
1
The full citation, including the ICC registration reference of all designations and abbreviations used in
this judgment are included in Annex 1.
2
A more detailed procedural history is set out in Annex 2 of this judgment.
ICC-01/04-02/12-271-Corr 07-04-2015 7/117 EK A
8/117
representatives, participate in the present appeal proceedings for the purpose of
presenting their views and concerns in respect of their personal interests in the issues
on appeal”.
3
8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The
Prosecutor filed a confidential redacted version of the Document in Support of the
Appeal on 22 March 2013, and a public redacted version of the Document in Support
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of
the Appeal, the Prosecutor’s entire third ground of appeal was redacted.
"""
result = re.findall(r'(?s)(\d+)\n +(.*?)\s*(?=\d+\n)', test_str)
print(result)
答案 1 :(得分:1)
我相信这个正则表达式:(^\d+(?!\.).*?)(?=^\s*\d)
可以按照您的描述工作。
Python演示
>>> import re
>>> print ''.join(re.findall(r'(^\d+(?!\.).*?)(?=^\s*\d)', test_str, flags=re.M|re.S))
1
The full citation, including the ICC registration reference of all designations and abbreviations used in
this judgment are included in Annex 1.
2
A more detailed procedural history is set out in Annex 2 of this judgment.
ICC-01/04-02/12-271-Corr 07-04-2015 7/117 EK A
如果要捕获脚注编号与文本分开:
>>> re.findall(r'^(\d+)((?!\.).*?)(?=\s*^\d)', test_str, flags=re.M|re.S)
[(u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n'), (u'2', u'\n A more detailed procedural history is set out in Annex 2 of this judgment. \nICC-01/04-02/12-271-Corr 07-04-2015 7/117 EK A\n')]