Question

我有两种类型的文档，包含我要复制和保存的文本。一种文档类型具有由名为TAGSTART和TAGEND的标记分隔的有趣文本。另一种文档类型具有由CORESTART和COREEND分隔的有趣文本。这是两个样本：

intro intro intro intro intro intro
BEGIN A This is where some text starts
That is not interesting or wanted
CORESTART save text A save text A save text A save text A 
save text A save text A save text A save text A save text A 
save text A COREEND
This is an addendum that is not needed but is just in the way
END A outro outro outro outro outro outro 
outro outro outro outro outro outro outro

intro intro intro intro intro intro
INIT B This is where some text starts
That is not interesting or wanted
TAGSTART B save text B save text B save text B save text B 
save text B save text B save text B save text B save text B 
save text B TAGEND B
This is an addendum that is not needed but is just in the way
TERM B outro outro outro outro outro outro 
outro outro outro outro outro outro outro

并且此python脚本适用于第一种类型的文件

import os
import re
import codecs
# walk the directory tree
rootDir = '.'
for dirName, subdirs, files in os.walk(rootDir):
    #    exclude hidden files and directories
    files = [f for f in files if not f[0] == '.']
    subdirs[:] = [d for d in subdirs if not d[0] == '.']
    for fname in files:
         if fname.endswith(('.txt', '.TXT')):
            #    create the full path
            filename = os.path.join(dirName, fname)
            with codecs.open(filename, encoding='utf-8', errors='ignore') as infile, codecs.open('SAVED.txt', 'a',encoding='utf-8') as outfile: 
                stuff = infile.read()
                saveTEXT = '\n' + ''.join(re.findall(r"CORESTART(.+?)COREEND", stuff, re.DOTALL|re.MULTILINE)) + '\n'
                outfile.write(saveTEXT)
                infile.close()
                outfile.close()

如果我将正则表达式更改为

      saveTEXT = '\n' + ''.join(re.findall(r"TAGSTART B(.+?)TAGEND B", stuff, re.DOTALL|re.MULTILINE)) + '\n'

我可以从第二种类型的文件中得到我想要的东西。然而复合正则表达式失败了：

      saveTEXT = '\n' + ''.join(re.findall(r"CORESTART|TAGSTART B(.+?)COREEND|TAGEND B", stuff, re.DOTALL|re.MULTILINE)) + '\n'

什么都没找到。我尝试将原始正则表达式包含在parens中，但后来我得到一个错误，正则表达式期待一个字符串，但找到一个元组。我尝试使用\ b设置正则表达式中的单词以指示单词边界，如此

       saveTEXT = '\n' + ''.join(re.findall(r"\bCORESTART B\b|\bTAGSTART B\b(.+?)\bCOREEND B\b|\bTAGEND B\b", stuff, re.DOTALL|re.MULTILINE)) + '\n'

但这也是空的。当我尝试使用原始字符串时，我的思绪完全被吹掉了：

[\bCORESTART\b|\bTAGSTART B\b](.+?)[\bCOREEND\b|\bTAGEND B\b]

我可以对我忽略的内容有所指导吗？我的大脑已煮熟了。

Answer 1

如果允许一些小的偏差，

@bobblebubble's regex in the deleted answer是正确的方法（比如“CORESTART”后面可能跟一个你不想从比赛中得到的空格+“B”）。也就是说，我建议将(?: B)?添加到(TAG|CORE)START B(.+?)(\1END B)正则表达式：

(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)

请参阅regex demo

您还必须使用re.finditer提取字符串，因为re.findall将提取所有字幕组值。

IDEONE demo：

import re
p = re.compile(r'(TAG|CORE)START(?: B)?(.+?)(\1END(?: B)?)', re.DOTALL)
test_str = "intro intro intro intro intro intro\nBEGIN A This is where some text starts\nThat is not interesting or wanted\nCORESTART save text A save text A save text A save text A \nsave text A save text A save text A save text A save text A \nsave text A COREEND\nThis is an addendum that is not needed but is just in the way\nEND A outro outro outro outro outro outro \noutro outro outro outro outro outro outro \n.\n\nintro intro intro intro intro intro\nINIT B This is where some text starts\nThat is not interesting or wanted\nTAGSTART B save text B save text B save text B save text B \nsave text B save text B save text B save text B save text B \nsave text B TAGEND B\nThis is an addendum that is not needed but is just in the way\nTERM B outro outro outro outro outro outro \noutro outro outro outro outro outro outro "
print([x.group(2) for x in p.finditer(test_str)])

请注意，re.MULTILINE在您的正则表达式中是多余的，因为此标记重新定义了开始匹配行的开头和结尾的^和$个锚点的行为而不是整个字符串（分别）。因此，我把它从正则表达式声明中删除了。

python复合正则表达式在不同文档中的不同标记之间提取文本

1 个答案: