Python-在多个子字符串大小之间捕获多个子字符串

时间:2018-11-02 16:42:56

标签: python regex string python-3.x

我拥有的数据格式非常错误.txt。我试图捕获这些开始和结束字符串之间的完整单词/句子的含义。现在,我在文本中发现了大约4种类型的子字符串模式。我试图捕获这些多个开始和结束子字符串之间的字符串。我能够捕获第一个出现的字符串,但是不能正确捕获第二个,第三个等等。

开始和结束字符串: FOO,BARS,BAR,BAR2

text = 'I do not want this FOO string1 BARS I do not want this FOO string 2 BAR I do not want this FOO string3 BAR2 I do not want this FOO string4 BARS '


snippet1 = text[text.index('FOO')+len('FOO'):text.index('BARS')] \
            if text[text.index('FOO')+len('FOO'):text.index('BARS')] else ''

snippet2 = text[text.index('FOO')+len('FOO'):text.index('BAR')] \
            if text[text.index('FOO')+len('FOO'):text.index('BAR')] else ''

snippet3 = text[text.index('FOO')+len('FOO'):text.index('BAR2')] \
            if text[text.index('FOO')+len('FOO'):text.index('BAR2')] else ''

# print(type(snippet1))
print('')
print('snippet1:',snippet1) #Output: snippet1:  string1
print('')
print('snippet2',snippet2) # Output: snippet2  string1
print('')
print('snippet3',snippet3) # Output: snippet3  string1 BARS I do not want this FOO string2 BAR I do not want this FOO string3

# How do I get this output? Is it possible to code this?
snippet1:  string1
snippet2:  string2
snippet3:  string3

2 个答案:

答案 0 :(得分:2)

IIUC:您可以使用regex

import re
txt='I do not want this FOO string1 BARS I do not want this FOO string 2 BAR I do not want this FOO string3 BAR2 I do not want this FOO string4 BARS '
re.findall('FOO(.*?)BAR', txt)

将生成如下所示的匹配字符串列表:

[' string1 ', ' string 2 ', ' string3 ', ' string4 ']

更新为与多个关键字匹配:

import re
txt='I do not want this FOO string1 BARS I do not want this FOO string 2 SECTION I do not want this FOO string3 BAR2 I do not want this FOO string4 BARS'
re.findall('FOO(.*?)[BAR|SECTION]', txt)

将导致:

[' string1 ', ' string 2 ', ' string3 ', ' string4 ']

答案 1 :(得分:1)

您想要的是这种东西。

def find_substrings(text, start_marker, end_marker):
    index = 0
    results = []

    while True:
        index = text.find(start_marker, index)
        if index == -1: # If the start string wasn't found then there are no more instances left in the string
            break
        index2 = text.find(end_marker, index+len(start_marker))
        if index2 == -1: # Sub string was not terminated. 
            break
        results.append(text[index+len(start_marker):index2])
        index = index2 + len(end_marker)

    return results

当前,您正在使用索引(类似于find,但是如果找不到任何内容将抛出错误),但是每次都会开始查找字符串的开头。

text = 'I do not want this FOO string1 BARS I do not want this FOO string 2 BAR I do not want this FOO string3 BAR2 I do not want this FOO string4 BARS '
find_substrings(text, "FOO ", " BAR")

将返回

['string1', 'string 2', 'string3', 'string4']