Question

我有一个很大的字符串列表（不超过2k），并且我想在列表中找到最常见的部分字符串匹配项。例如，我试图以有效的方式满足以下测试用例。

data = [
    'abcdef',
    'abcxyz',
    'xyz',
    'def',
]
result = magic_function(data)
assert result == 'abc'

我从this stackoverflow post那里得到了启发，但是列表中的某些元素完全不同的事实使它不起作用。

def magic_function(data):
    return ''.join(c[0] for c in takewhile(lambda x: all(x[0] == y for y in x), zip(*data)))

Answer 1

您可能需要对此进行调整并进行性能测试。

我基本上将所有部分子字符串馈入data into a Counter中每个单词的长度，并基于len(substring)*occurence创建一个排名-用0.1乘以0.1仅惩罚出现次数：< / p>

data = [
    'abcdef',
    'abcxyz',
    'xyz',
    'def',
]    

def magic(d):
    """Applies magic(tm) to the list of strings given as 'd'.
    Returns a list of ratings which might be the coolest substring."""
    from collections import Counter
    myCountings = Counter()

    def allParts(word):
        """Generator that yields all possible word-parts."""
        for i in range(1,len(word)):
            yield word[:i]

    for part in d:
        # count them all
        myCountings.update(allParts(part))

    # get all as tuples and sort based on heuristic length*occurences
    return sorted(myCountings.most_common(), 
                  key=lambda x:len(x[0])*(x[1] if x[1] > 1 else 0.1), reverse=True)

m = magic(data)    
print( m ) # use  m[0][0] f.e.

输出：

 [('abc', 2), ('ab', 2), ('a', 2), ('abcde', 1), ('abcxy', 1), 
  ('abcd', 1), ('abcx', 1), ('xy', 1), ('de', 1), ('x', 1), ('d', 1)]

您将需要稍微调整排序标准，并且只使用结果列表中的第一个-但您可以将其用作入门程序。

如果您喜欢较长的而不是多个短的，则可以通过将长度乘以faktor来进行调整-这取决于您的数据...

字符串列表中最常见的部分字符串匹配

1 个答案: