匹配字符串但不匹配顺序重要的子字符串

时间:2021-06-26 16:48:48

标签: python string

我正在尝试在 python 中匹配字符串:

例如,如果我的短语是 "long string"

我想匹配 "long string", "Long StrInG", "long!!!string" 但不想匹配 "Long strings""stringlong"。即)我想在任何文本字符串中匹配我的字符串的所有实例,不考虑大小写,不捕获子字符串。

ie) when I do 
string = "hello"
strings = "hellos"
string in strings == True but I don't want this to be true

我还希望字符串能够捕获由空格或标点符号分隔的长句中的任何实例:

ie) string = "long string" should match
"hello ~~~!!!!! long !@#!@#!@ string"

Whitespace also matters - I don't want to match 
string = "longstringlongstring" or "longs trying"

这是我迄今为止尝试过的:

text = text where we are seeing if it contains instance of string 
phrase = string to look for in text

cleaned_text = ""
        for char in text:
            if char in string.punctuation:
                char = " "
                cleaned_text += char 
            else:
                cleaned_text += char.lower()
        cleaned_string = " ".join(cleaned_text.split())
        
        counter = 0
        for char in cleaned_string:
            for char2 in phrase:
                if char == char2:
                    counter += 1
        if counter == len(phrase):
            return True
        return False

我意识到我不能使用列表,因为顺序无关紧要。非常感谢一些建议!

1 个答案:

答案 0 :(得分:1)

使用正则表达式:

import re
import string

# given phrase
phrase = "long string"

# this says what can go between two words of the phrase above
between = "[" + r"\s" + re.escape(string.punctuation) + "]+"

# the pattern
pat = r"\b" + between.join(phrase.split()) + r"\b"
reg = re.compile(pat, flags=re.I)

其中 between 由空格 (\s) 和来自 string.punctuation 的所有标点字符组成,至少可以看到一次(由于 []+ 围绕它) .我们 re.escape 它,因为它包含正则表达式元字符,但我们需要在那里进行文字匹配(例如,$)。然后 pattern 形成为 join 用这个 between 对短语的单词进行连接,最后在两端放置单词边界 (\b) 以确保精确匹配,例如,阻止 long stringS 匹配。 re.I 在编译正则表达式时说忽略大小写。

对于这个短语,pat 看起来像

\blong[\s!"\#\$%\&\'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~]+string\b

如果要输入一个词 phrase,例如 phrase = "this",则

\bthis\b

即,之间没有标点符号和空格,因为只有一个词。

最后,对于一个 3 字的 phrase,例如,phrase = "no escape needed"

\bno[\s!"\#\$%\&\'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~]+escape[\s!"\#\$%\&\'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~]+needed\b

即,它动态地形成正则表达式。


用于测试的示例运行(如果是 is not None,则匹配):

>>> re.search(reg, "long string") is not None
True

>>> re.search(reg, "Long StrInG") is not None
True

>>> re.search(reg, "long!!!string") is not None
True

>>> re.search(reg, "Long strings") is not None
False

>>> re.search(reg, "stringlong") is not None
False

>>> re.search(reg, "hello ~~~!!!!! long !@#!@#!@ string") is not None
True

>>> re.search(reg, "longstring") is not None
False

您可以参考正则表达式详情here