将2个列表与通配符匹配的算法

时间:2012-01-13 07:30:40

标签: python string algorithm pattern-matching string-matching

我正在寻找一种匹配2个列表的有效方法,一个包含完整信息,另一个包含通配符。我已经能够使用固定长度的通配符来做到这一点,但我现在正尝试使用可变长度的通配符。

因此:

match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )
只要两个列表中的所有元素的顺序相同,

就会返回True。

我正在处理对象列表,但为了简单起见,上面使用了字符串。

6 个答案:

答案 0 :(得分:4)

[编辑以证明OP对比较对象的评论后没有RE]

看起来你没有使用字符串,而是比较对象。因此我给出了一个明确的算法 - 正则表达式为字符串量身定制了一个很好的解决方案,不要误解我的意思,但从你所说的评论到你的问题,似乎一个明确的,简单的算法可能会让事情变得更容易

事实证明,这可以通过比this previous answer更简单的算法来解决:

def matcher (l1, l2):
    if (l1 == []):
        return (l2 == [] or l2 == ['*'])
    if (l2 == [] or l2[0] == '*'):
        return matcher(l2, l1)
    if (l1[0] == '*'):
        return (matcher(l1, l2[1:]) or matcher(l1[1:], l2))
    if (l1[0] == l2[0]):
        return matcher(l1[1:], l2[1:])
    else:
        return False

关键的想法是,当您遇到通配符时,您可以探索两个选项:

  • 在包含通配符的列表中前进(并考虑通配符匹配到目前为止的任何内容)
  • 或在不包含通配符的列表中前进(并考虑列表头部的任何内容必须与通配符匹配)。

答案 1 :(得分:1)

以下内容如何:

import re

def match(pat, lst):
  regex = ''.join(term if term != '*' else '.*' for term in pat) + '$'
  s = ''.join(lst)
  return re.match(regex, s) is not None

print match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )

它使用正则表达式。通配符(*)已更改为.*,所有其他搜索字词保持原样。

有一点需要注意的是,如果您的搜索字词可能包含正则表达式语言中具有特殊含义的内容,则需要对其进行适当的转义。在match函数中处理这个很容易,我只是不确定这是否是你需要的。

答案 2 :(得分:1)

我建议将['A', 'B', '*', 'D']转换为'^AB.*D$',将['A', 'B', 'C', 'C', 'C', 'D']转换为'ABCCCD',然后使用re模块(正则表达式)进行匹配

如果列表中的元素每个只有一个字符,并且它们是字符串,则此选项有效。

类似的东西:

import(re)
def myMatch( patternList, stringList ):
    # convert pattern to flat string with wildcards
    # convert AB*D to valid regex ^AB.*D$
    pattern = ''.join(patternList) 
    regexPattern = '^' + pattern.replace('*','.*') + '$' 
    # perform matching
    against = ''.join(stringList) # convert ['A','B','C','C','D'] to ABCCCD
    # return whether there is a match
    return (re.match(regexPattern,against) is not None)

如果列表包含数字或单词,请选择您不希望出现的字符,例如#。然后['Aa','Bs','Ce','Cc','CC','Dd']可以转换为Aa#Bs#Ce#Cc#CC#Dd,通配符模式['Aa','Bs','*','Dd']可以转换为^Aa#Bs#.*#Dd$,然后执行匹配。

实际上,这只是意味着''.join(...)中的所有'#'.join(...)变为myMatch

答案 3 :(得分:0)

我同意有关此问题的评论可以使用正则表达式完成。例如:

import re

lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = ['A', 'B', 'C+', 'D']

print re.match(''.join(pattern), ''.join(lst)) # Will successfully match

编辑:正如评论所指出的那样,事先可能只知道某些角色必须匹配,而不是哪一个角色。在这种情况下,正则表达式仍然有用:

import re

lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = r'AB(\w)\1*D'

print re.match(pattern, ''.join(lst)).groups()

答案 4 :(得分:0)

我同意,正则表达式通常是这种方式。这个算法有效,但它看起来很复杂。写作虽然很有趣。

def match(listx, listy):
    listx, listy = map(iter, (listx, listy))
    while 1:
        try:
            x = next(listx)
        except StopIteration:
            # This means there are values left in listx that are not in listy.
            try:
                y = next(listy)
            except StopIteration:
                # This means there are no more values to be compared in either
                # listx or listy; since no exception was raied elsewhere, the
                # lists match.
                return True
            else:
                # This means that there are values in listy that are not in
                # listx.
                return False
        else:
            try:
                y = next(listy)
            except StopIteration:
                # Similarly, there are values in listy that aren't in listx.
                return False
        if x == y:
            pass
        elif x == '*':
            try:
                # Get the value in listx after '*'.
                x = next(listx)
            except StopIteration:
                # This means that listx terminates with '*'. If there are any
                # remaining values of listy, they will, by definition, match.
                return True
            while 1:
                if x == y:
                    # I didn't shift to the next value in listy because I
                    # assume that a '*' matches the empty string and well as
                    # any other.
                    break
                else:
                    try:
                        y = next(listy)
                    except StopIteration:
                        # This means there is at least one remaining value in
                        # listx that is not in listy, because listy has no
                        # more values.
                        return False
                    else:
                        pass
        # Same algorithm as above, given there is a '*' in listy.
        elif y == '*':
            try:
                y = next(listy)
            except StopIteration:
                return True
            while 1:
                if x == y:
                    break
                else:
                    try:
                        x = next(listx)
                    except StopIteration:
                        return False
                    else:
                        pass

答案 5 :(得分:0)

我有这段c ++代码似乎正在做你想做的事情(输入是字符串而不是字符数组,但你无论如何都要调整东西)。

bool Utils::stringMatchWithWildcards (const std::string str, const std::string strWithWildcards)
    PRINT("Starting in stringMatchWithWildcards('" << str << "','" << strWithWildcards << "')");
    const std::string wildcard="*";

    const bool startWithWildcard=(strWithWildcards.find(wildcard)==0);
    int pos=strWithWildcards.rfind(wildcard);
    const bool endWithWildcard = (pos!=std::string::npos) && (pos+wildcard.size()==strWithWildcards.size());

    // Basically, the point is to split the string with wildcards in strings with no wildcard.
    // Then search in the first string for the different chunks of the second in the correct order
    std::vector<std::string> vectStr;
    boost::split(vectStr, strWithWildcards, boost::is_any_of(wildcard));
    // I expected all the chunks in vectStr to be non-empty. It doesn't seem the be the case so let's remove them.
    vectStr.erase(std::remove_if(vectStr.begin(), vectStr.end(), std::mem_fun_ref(&std::string::empty)), vectStr.end());

    // Check if at least one element (to have first and last element)
    if (vectStr.empty())
    {
        const bool matchEmptyCase = (startWithWildcard || endWithWildcard || str.empty());
        PRINT("Match " << (matchEmptyCase?"":"un") << "successful (empty case) : '" << str << "' and '" << strWithWildcards << "'");
        return matchEmptyCase;
    }

    // First Element
    std::vector<std::string>::const_iterator vectStrIt = vectStr.begin();
    std::string aStr=*vectStrIt;
    if (!startWithWildcard && str.find(aStr, 0)!=0) {
        PRINT("Match unsuccessful (beginning) : '" << str << "' and '" << strWithWildcards << "'");
        return false;
    }

    // "Normal" Elements
    bool found(true);
    pos=0;
    std::vector<std::string>::const_iterator vectStrEnd = vectStr.end();
    for ( ; vectStrIt!=vectStrEnd ; vectStrIt++)
    {
        aStr=*vectStrIt;
        PRINT( "Searching '" << aStr << "' in '" << str << "' from  " << pos);
        pos=str.find(aStr, pos);
        if (pos==std::string::npos)
        {
            PRINT("Match unsuccessful ('" << aStr << "' not found) : '" << str << "' and '" << strWithWildcards << "'");
            return false;
        } else
        {
            PRINT( "Found at position " << pos);
            pos+=aStr.size();
        }
    }

    // Last Element
    const bool matchEnd = (endWithWildcard || str.rfind(aStr)+aStr.size()==str.size());
    PRINT("Match " << (matchEnd?"":"un") << "successful (usual case) : '" << str << "' and '" << strWithWildcards);
    return matchEnd;
}

   /* Tested on these values :
   assert( stringMatchWithWildcards("ABC","ABC"));
   assert( stringMatchWithWildcards("ABC","*"));
   assert( stringMatchWithWildcards("ABC","*****"));
   assert( stringMatchWithWildcards("ABC","*BC"));
   assert( stringMatchWithWildcards("ABC","AB*"));
   assert( stringMatchWithWildcards("ABC","A*C"));
   assert( stringMatchWithWildcards("ABC","*C"));
   assert( stringMatchWithWildcards("ABC","A*"));

   assert(!stringMatchWithWildcards("ABC","BC"));
   assert(!stringMatchWithWildcards("ABC","AB"));
   assert(!stringMatchWithWildcards("ABC","AB*D"));
   assert(!stringMatchWithWildcards("ABC",""));

   assert( stringMatchWithWildcards("",""));
   assert( stringMatchWithWildcards("","*"));
   assert(!stringMatchWithWildcards("","ABC"));
   */

这不是我真正引以为豪的事情,但它似乎到目前为止仍在努力。我希望你能发现它很有用。