我有两种类型的文档,包含我要复制和保存的文本。一种文档类型具有由名为TAGSTART和TAGEND的标记分隔的有趣文本。另一种文档类型具有由CORESTART和COREEND分隔的有趣文本。这是两个样本:
intro intro intro intro intro intro
BEGIN A This is where some text starts
That is not interesting or wanted
CORESTART save text A save text A save text A save text A
save text A save text A save text A save text A save text A
save text A COREEND
This is an addendum that is not needed but is just in the way
END A outro outro outro outro outro outro
outro outro outro outro outro outro outro
intro intro intro intro intro intro
INIT B This is where some text starts
That is not interesting or wanted
TAGSTART B save text B save text B save text B save text B
save text B save text B save text B save text B save text B
save text B TAGEND B
This is an addendum that is not needed but is just in the way
TERM B outro outro outro outro outro outro
outro outro outro outro outro outro outro
并且此python脚本适用于第一种类型的文件
import os
import re
import codecs
# walk the directory tree
rootDir = '.'
for dirName, subdirs, files in os.walk(rootDir):
# exclude hidden files and directories
files = [f for f in files if not f[0] == '.']
subdirs[:] = [d for d in subdirs if not d[0] == '.']
for fname in files:
if fname.endswith(('.txt', '.TXT')):
# create the full path
filename = os.path.join(dirName, fname)
with codecs.open(filename, encoding='utf-8', errors='ignore') as infile, codecs.open('SAVED.txt', 'a',encoding='utf-8') as outfile:
stuff = infile.read()
saveTEXT = '\n' + ''.join(re.findall(r"CORESTART(.+?)COREEND", stuff, re.DOTALL|re.MULTILINE)) + '\n'
outfile.write(saveTEXT)
infile.close()
outfile.close()
如果我将正则表达式更改为
saveTEXT = '\n' + ''.join(re.findall(r"TAGSTART B(.+?)TAGEND B", stuff, re.DOTALL|re.MULTILINE)) + '\n'
我可以从第二种类型的文件中得到我想要的东西。然而复合正则表达式失败了:
saveTEXT = '\n' + ''.join(re.findall(r"CORESTART|TAGSTART B(.+?)COREEND|TAGEND B", stuff, re.DOTALL|re.MULTILINE)) + '\n'
什么都没找到。我尝试将原始正则表达式包含在parens中,但后来我得到一个错误,正则表达式期待一个字符串,但找到一个元组。我尝试使用\ b设置正则表达式中的单词以指示单词边界,如此
saveTEXT = '\n' + ''.join(re.findall(r"\bCORESTART B\b|\bTAGSTART B\b(.+?)\bCOREEND B\b|\bTAGEND B\b", stuff, re.DOTALL|re.MULTILINE)) + '\n'
但这也是空的。当我尝试使用原始字符串时,我的思绪完全被吹掉了:
[\bCORESTART\b|\bTAGSTART B\b](.+?)[\bCOREEND\b|\bTAGEND B\b]
我可以对我忽略的内容有所指导吗?我的大脑已煮熟了。
答案 0 :(得分:3)
@bobblebubble's regex in the deleted answer是正确的方法(比如“CORESTART”后面可能跟一个你不想从比赛中得到的空格+“B”)。也就是说,我建议将(?: B)?
添加到(TAG|CORE)START B(.+?)(\1END B)
正则表达式:
(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)
请参阅regex demo
您还必须使用re.finditer
提取字符串,因为re.findall
将提取所有字幕组值。
import re
p = re.compile(r'(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)', re.DOTALL)
test_str = "intro intro intro intro intro intro\nBEGIN A This is where some text starts\nThat is not interesting or wanted\nCORESTART save text A save text A save text A save text A \nsave text A save text A save text A save text A save text A \nsave text A COREEND\nThis is an addendum that is not needed but is just in the way\nEND A outro outro outro outro outro outro \noutro outro outro outro outro outro outro \n.\n\nintro intro intro intro intro intro\nINIT B This is where some text starts\nThat is not interesting or wanted\nTAGSTART B save text B save text B save text B save text B \nsave text B save text B save text B save text B save text B \nsave text B TAGEND B\nThis is an addendum that is not needed but is just in the way\nTERM B outro outro outro outro outro outro \noutro outro outro outro outro outro outro "
print([x.group(2) for x in p.finditer(test_str)])
请注意,re.MULTILINE
在您的正则表达式中是多余的,因为此标记重新定义了开始匹配行的开头和结尾的^
和$
个锚点的行为而不是整个字符串(分别)。因此,我把它从正则表达式声明中删除了。