带有re.M的re.findall找不到我要搜索的多行
我正在尝试从文件中提取与模式匹配的所有多行字符串
文件book.txt
中的示例:
Title: Le Morte D'Arthur, Volume I (of II)
King Arthur and of his Noble Knights of the Round Table
Author: Thomas Malory
Editor: William Caxton
Release Date: March, 1998 [Etext #1251]
Posting Date: November 6, 2009
Language: English
Title: Pride and Prejudice
Author: Jane Austen
Posting Date: August 26, 2008 [EBook #1342]
Release Date: June, 1998
Last Updated: October 17, 2016
Language: English
以下代码仅返回第一行Le Morte D'Arthur, Volume I (of II)
re.findall('^Title:\s(.+)$', book, re.M)
我希望输出为
[' Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table', ' Pride and Prejudice']
为了澄清,
-第二行是可选的,在某些文件中存在第二行。在第二行之后还有更多我不想阅读的文本。
-使用re.findall(r'Title: (.+\n.+)$', text, flags=re.MULTILINE)
有效,但如果第二行为空白,则失败。
-我正在运行python3.7。
-我将txt文件转换为字符串,然后在str上运行re
。
-以下内容也不起作用:
re.findall(r'^Title:\s(.+)$', text, re.S)
re.findall(r'^Title:\s(.+)$', text, re.DOTALL)
答案 0 :(得分:1)
我猜可能是这个表情
(?<=Title:\s)(.*?)\s*(?=Author)
可能接近可能需要设计的内容。
import re
regex = r"(?<=Title:\s)(.*?)\s*(?=Author)"
test_str = ("Title: Le Morte D'Arthur, Volume I (of II)\n"
" King Arthur and of his Noble Knights of the Round Table\n\n"
"Title: Le Morte D'Arthur, Volume I (of II)\n"
" King Arthur and of his Noble Knights of the Round Table")
print(re.findall(regex, test_str, re.DOTALL))
["Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table\n\n", "Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table"]
答案 1 :(得分:1)
您可以将正则表达式与DOTALL
标志一起使用,以允许.
匹配换行符char:
re.findall('^Title:\s(.+)$', book, re.DOTALL)
输出:
Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table